## <font color='blue'> Spark: Création d'estimators et transformers </font>

L'objectif est de montrer avec un exemple simple commment créer ses propres estimators et transformers avec Scala.

1. L'estimator et le transformer sont d'abord créés en Scala 
2. Ensuite les classes sont packagées en créant un JAR contenant les classes et leurs dépendances (via plugin assembly de SBT)
2. Le fichier JAR doit être accessible pour Spark (via la classpath de Spark par exemple)

Le code est [ici](src\main\scala)

In [2]:
import spark.implicits._

val simpleData = Seq(("James", "Sales", 3000),
    ("Michael", "Sales", 4600),
    ("Robert", "Sales", 4100),
    ("Maria", "Finance", 3000),
    ("James", "Sales", 3000),
    ("Scott", "Finance", 3300),
    ("Jen", "Finance", 3900),
    ("Jeff", "Marketing", 3000),
    ("Kumar", "Marketing", 2000),
    ("Saif", "Sales", 4100)
  )

val df = simpleData.toDF("employee_name", "department", "salary")
df.show()

+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
|        James|     Sales|  3000|
|      Michael|     Sales|  4600|
|       Robert|     Sales|  4100|
|        Maria|   Finance|  3000|
|        James|     Sales|  3000|
|        Scott|   Finance|  3300|
|          Jen|   Finance|  3900|
|         Jeff| Marketing|  3000|
|        Kumar| Marketing|  2000|
|         Saif|     Sales|  4100|
+-------------+----------+------+



import spark.implicits._
simpleData: Seq[(String, String, Int)] = List((James,Sales,3000), (Michael,Sales,4600), (Robert,Sales,4100), (Maria,Finance,3000), (James,Sales,3000), (Scott,Finance,3300), (Jen,Finance,3900), (Jeff,Marketing,3000), (Kumar,Marketing,2000), (Saif,Sales,4100))
df: org.apache.spark.sql.DataFrame = [employee_name: string, department: string ... 1 more field]


**Estimator pour remplacer les modalités d'une variable catégorielle par des entiers**

In [3]:
import feature.CustomStringIndexer
val indexer = new CustomStringIndexer()
                  .setInputCol("department")
                  .setOutputCol("departmentIndex")

indexer.fit(df).transform(df).show()

+-------------+----------+------+---------------+
|employee_name|department|salary|departmentIndex|
+-------------+----------+------+---------------+
|        James|     Sales|  3000|            2.0|
|      Michael|     Sales|  4600|            2.0|
|       Robert|     Sales|  4100|            2.0|
|        Maria|   Finance|  3000|            0.0|
|        James|     Sales|  3000|            2.0|
|        Scott|   Finance|  3300|            0.0|
|          Jen|   Finance|  3900|            0.0|
|         Jeff| Marketing|  3000|            1.0|
|        Kumar| Marketing|  2000|            1.0|
|         Saif|     Sales|  4100|            2.0|
+-------------+----------+------+---------------+



import feature.CustomStringIndexer
indexer: feature.CustomStringIndexer = cstri_383fc9739cb5


In [4]:
val model = indexer.fit(df)
model.write.overwrite().save("spark-model")

model: feature.CustomStringIndexerModel = CustomStringIndexerModel: uid=cstri_383fc9739cb5


In [5]:
import feature.CustomStringIndexerModel
val modelLoaded = CustomStringIndexerModel.load("spark-model")
modelLoaded.transform(df).show()

+-------------+----------+------+---------------+
|employee_name|department|salary|departmentIndex|
+-------------+----------+------+---------------+
|        James|     Sales|  3000|            2.0|
|      Michael|     Sales|  4600|            2.0|
|       Robert|     Sales|  4100|            2.0|
|        Maria|   Finance|  3000|            0.0|
|        James|     Sales|  3000|            2.0|
|        Scott|   Finance|  3300|            0.0|
|          Jen|   Finance|  3900|            0.0|
|         Jeff| Marketing|  3000|            1.0|
|        Kumar| Marketing|  2000|            1.0|
|         Saif|     Sales|  4100|            2.0|
+-------------+----------+------+---------------+



import feature.CustomStringIndexerModel
modelLoaded: feature.CustomStringIndexerModel = CustomStringIndexerModel: uid=cstri_383fc9739cb5


**Transformer pour supprimer une colonne d'un dataframe**

In [6]:
import feature.ColumnFilter
val filter = new ColumnFilter().setDropCols("salary")
filter.transform(df).show()

+-------------+----------+
|employee_name|department|
+-------------+----------+
|        James|     Sales|
|      Michael|     Sales|
|       Robert|     Sales|
|        Maria|   Finance|
|        James|     Sales|
|        Scott|   Finance|
|          Jen|   Finance|
|         Jeff| Marketing|
|        Kumar| Marketing|
|         Saif|     Sales|
+-------------+----------+



import feature.ColumnFilter
filter: feature.ColumnFilter =
ColumnFilter: uid=colfilter_0776de12ebcd
Drop: salary


In [7]:
filter.write.overwrite().save("spark-filter")

In [8]:
val transLoad = ColumnFilter.load("spark-filter")
transLoad.transform(df).show()

+-------------+----------+
|employee_name|department|
+-------------+----------+
|        James|     Sales|
|      Michael|     Sales|
|       Robert|     Sales|
|        Maria|   Finance|
|        James|     Sales|
|        Scott|   Finance|
|          Jen|   Finance|
|         Jeff| Marketing|
|        Kumar| Marketing|
|         Saif|     Sales|
+-------------+----------+



transLoad: feature.ColumnFilter =
ColumnFilter: uid=colfilter_0776de12ebcd
Drop: salary


**Les ajouter dans un pipeline**

In [9]:
import org.apache.spark.ml.Pipeline

val indexer = new CustomStringIndexer()
                  .setInputCol("department")
                  .setOutputCol("departmentIndex")

val filter = new ColumnFilter().setDropCols("department")

val pipeline = new Pipeline()
              .setStages(Array(indexer, filter))

val model = pipeline.fit(df)

model.transform(df).show()

+-------------+------+---------------+
|employee_name|salary|departmentIndex|
+-------------+------+---------------+
|        James|  3000|            2.0|
|      Michael|  4600|            2.0|
|       Robert|  4100|            2.0|
|        Maria|  3000|            0.0|
|        James|  3000|            2.0|
|        Scott|  3300|            0.0|
|          Jen|  3900|            0.0|
|         Jeff|  3000|            1.0|
|        Kumar|  2000|            1.0|
|         Saif|  4100|            2.0|
+-------------+------+---------------+



import org.apache.spark.ml.Pipeline
indexer: feature.CustomStringIndexer = cstri_9bb58bd35519
filter: feature.ColumnFilter =
ColumnFilter: uid=colfilter_82bde31a7e32
Drop: department
pipeline: org.apache.spark.ml.Pipeline = pipeline_fdc2fa30ea6f
model: org.apache.spark.ml.PipelineModel = pipeline_fdc2fa30ea6f


In [10]:
model.write.overwrite().save("spark-pipeline")

**Sources :**     
[Custom Spark ML with a Python wrapper](https://raufer.github.io/2018/02/08/custom-spark-models-with-python-wrappers/)    
[Custom transformer for Apache Spark and MLeap](https://medium.com/@m.majidpour/custom-transformer-for-apache-spark-and-mleap-66b95ad0d37b)