## Package ml-feature (org.apache.spark.ml.feature)

Il s'agit du package de feature engineering (préparation et formatage de données). Il est divisé en trois parties :

* **Transformation :** traitement de variables ; normalisation, standardisation, codification, ...
* **Extraction :** Extraction de variables à partir de données "brutes".
* **Sélection :** Sélection de variables

## Représentation des vecteurs avec Spark

Les vecteurs denses et sparses constituent l'un élément de base de Spark ML.

Les vecteurs denses stockent l’intégralité des valeurs

In [2]:
import org.apache.spark.ml.linalg.Vectors

// Créer un vecteur dense (1.5, 0.0, 3.5)
val vectDense = Vectors.dense(1, 1.3, 0.0, 3, 3.5, 5)

// Créer un vecteur dense de dimension 10 avec toutes les composantes 0.0
val vectZeros = Vectors.zeros(10)

import org.apache.spark.ml.linalg.Vectors
vectDense: org.apache.spark.ml.linalg.Vector = [1.0,1.3,0.0,3.0,3.5,5.0]
vectZeros: org.apache.spark.ml.linalg.Vector = [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]


Les vecteurs sparses ou vecteurs creux stockent seulement les valeurs non-nulles et leurs indices. On peut créer un vecteur sparse de deux façons :

* `Vectors.sparse(size, indices, values)`
* `Vectors.sparse(size, Seq(indices, values))`

In [3]:
// En indiquant les indices des composantes nulles (1, 3, 4) et leur valeur
// vectSparse = (0.0, 10.5, 0.0, 3.0, 11.0), 
val vectSparse1 = Vectors.sparse(5, Array(1, 3, 4), Array(10.5, 3.0, 11.0))

// En indiquant la séquence des paires (indice, valeur) pour les composantes non nulles
val vectSparse2 = Vectors.sparse(5, Seq((1, 10.5), (3, 3.0), (4, 11.0)))

vectSparse1: org.apache.spark.ml.linalg.Vector = (5,[1,3,4],[10.5,3.0,11.0])
vectSparse2: org.apache.spark.ml.linalg.Vector = (5,[1,3,4],[10.5,3.0,11.0])


**Remarque** : 

Scala a sa propre représentation des vecteurs avec `scala.collection.immutable.Vector`

## Les transformateurs (Feature Transformers)

**VectorAssembler** : regroupe des colonnes en une seule colonne.

In [4]:
import org.apache.spark.ml.feature.VectorAssembler

val dfm = spark.read.option("header", true).option("inferSchema", "true").csv("../data/iris.txt")
dfm.show(2)

val assembler = new VectorAssembler()
  .setInputCols(Array("sepal_l", "sepal_w", "petal_l", "petal_w"))
  .setOutputCol("features")

val dfmAssembled = assembler.transform(dfm)

dfmAssembled.show(2)

+-------+-------+-------+-------+-----------+
|sepal_l|sepal_w|petal_l|petal_w|     classe|
+-------+-------+-------+-------+-----------+
|    5.1|    3.5|    1.4|    0.2|Iris-setosa|
|    4.9|    3.0|    1.4|    0.2|Iris-setosa|
+-------+-------+-------+-------+-----------+
only showing top 2 rows

+-------+-------+-------+-------+-----------+-----------------+
|sepal_l|sepal_w|petal_l|petal_w|     classe|         features|
+-------+-------+-------+-------+-----------+-----------------+
|    5.1|    3.5|    1.4|    0.2|Iris-setosa|[5.1,3.5,1.4,0.2]|
|    4.9|    3.0|    1.4|    0.2|Iris-setosa|[4.9,3.0,1.4,0.2]|
+-------+-------+-------+-------+-----------+-----------------+
only showing top 2 rows



import org.apache.spark.ml.feature.VectorAssembler
dfm: org.apache.spark.sql.DataFrame = [sepal_l: double, sepal_w: double ... 3 more fields]
assembler: org.apache.spark.ml.feature.VectorAssembler = VectorAssembler: uid=vecAssembler_95d0478000a4, handleInvalid=error, numInputCols=4
dfmAssembled: org.apache.spark.sql.DataFrame = [sepal_l: double, sepal_w: double ... 4 more fields]


**VectorIndexer** : identifier des variables catégorielles.

In [5]:
import org.apache.spark.ml.feature.VectorIndexer
import org.apache.spark.ml.feature.VectorAssembler

val dfm = spark.read.option("header", true).option("inferSchema", "true").csv("../data/iris.txt")
dfm.show(2)

val assembler = new VectorAssembler()
  .setInputCols(Array("sepal_l", "sepal_w", "petal_l", "petal_w"))
  .setOutputCol("features")

val dfmAssembled = assembler.transform(dfm)

dfmAssembled.show(2)

val indexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexed")
  .setMaxCategories(4)

val indexerModel = indexer.fit(dfmAssembled)

val categoricalFeatures: Set[Int] = indexerModel.categoryMaps.keys.toSet
println(s"Chose ${categoricalFeatures.size} " +
  s"categorical features: ${categoricalFeatures.mkString(", ")}")

// Create new column "indexed" with categorical values transformed to indices
val indexedData = indexerModel.transform(dfmAssembled)
indexedData.show(2)

+-------+-------+-------+-------+-----------+
|sepal_l|sepal_w|petal_l|petal_w|     classe|
+-------+-------+-------+-------+-----------+
|    5.1|    3.5|    1.4|    0.2|Iris-setosa|
|    4.9|    3.0|    1.4|    0.2|Iris-setosa|
+-------+-------+-------+-------+-----------+
only showing top 2 rows

+-------+-------+-------+-------+-----------+-----------------+
|sepal_l|sepal_w|petal_l|petal_w|     classe|         features|
+-------+-------+-------+-------+-----------+-----------------+
|    5.1|    3.5|    1.4|    0.2|Iris-setosa|[5.1,3.5,1.4,0.2]|
|    4.9|    3.0|    1.4|    0.2|Iris-setosa|[4.9,3.0,1.4,0.2]|
+-------+-------+-------+-------+-----------+-----------------+
only showing top 2 rows

Chose 0 categorical features: 
+-------+-------+-------+-------+-----------+-----------------+-----------------+
|sepal_l|sepal_w|petal_l|petal_w|     classe|         features|          indexed|
+-------+-------+-------+-------+-----------+-----------------+-----------------+
|    5.1|    

import org.apache.spark.ml.feature.VectorIndexer
import org.apache.spark.ml.feature.VectorAssembler
dfm: org.apache.spark.sql.DataFrame = [sepal_l: double, sepal_w: double ... 3 more fields]
assembler: org.apache.spark.ml.feature.VectorAssembler = VectorAssembler: uid=vecAssembler_a243cc922027, handleInvalid=error, numInputCols=4
dfmAssembled: org.apache.spark.sql.DataFrame = [sepal_l: double, sepal_w: double ... 4 more fields]
indexer: org.apache.spark.ml.feature.VectorIndexer = vecIdx_70b50df2e535
indexerModel: org.apache.spark.ml.feature.VectorIndexerModel = VectorIndexerModel: uid=vecIdx_70b50df2e535, numFeatures=4, handleInvalid=error
categoricalFeatures: Set[Int] = Set()
indexedData: org.apache.spark.sql.DataFrame = [sepal_l: double, sepal_w: double ... 5 more fields]


**StringIndexer** : permet de créer une variable numérique à partir d'une variable catégorielle en remplaçant les modalités par des valeurs numériques.

In [6]:
import org.apache.spark.ml.feature.StringIndexer

val sq = Seq((0, "mod1"), (1, "mod2"), (2, "mod3"), (3, "mod1"), (4, "mod1"), (5, "mod2"))

val df = spark.createDataFrame(sq).toDF("id", "category")

val indexer = new StringIndexer()
  .setInputCol("category")
  .setOutputCol("categoryIndex")

val indexed = indexer.fit(df).transform(df)
indexed.show()

indexed.printSchema

+---+--------+-------------+
| id|category|categoryIndex|
+---+--------+-------------+
|  0|    mod1|          0.0|
|  1|    mod2|          1.0|
|  2|    mod3|          2.0|
|  3|    mod1|          0.0|
|  4|    mod1|          0.0|
|  5|    mod2|          1.0|
+---+--------+-------------+

root
 |-- id: integer (nullable = false)
 |-- category: string (nullable = true)
 |-- categoryIndex: double (nullable = false)



import org.apache.spark.ml.feature.StringIndexer
sq: Seq[(Int, String)] = List((0,mod1), (1,mod2), (2,mod3), (3,mod1), (4,mod1), (5,mod2))
df: org.apache.spark.sql.DataFrame = [id: int, category: string]
indexer: org.apache.spark.ml.feature.StringIndexer = strIdx_af1765017d7b
indexed: org.apache.spark.sql.DataFrame = [id: int, category: string ... 1 more field]


**IndexToString** : réalise l'opération inverse de StringIndexer

In [7]:
import org.apache.spark.ml.feature.{StringIndexer, IndexToString}

val sq = Seq((0, "mod1"), (1, "mod2"), (2, "mod3"), (3, "mod1"), (4, "mod1"), (5, "mod2"))
val dfm = spark.createDataFrame(sq). toDF("id", "category")

val indexer = new StringIndexer()
                  .setInputCol("category")
                  .setOutputCol("categoryIndex")

val indexed = indexer.fit(dfm).transform(dfm)

val converter = new IndexToString()
                   .setInputCol("categoryIndex")
                    .setOutputCol("originalCategory")
val converted = converter.transform(indexed)

converted.show(false)

+---+--------+-------------+----------------+
|id |category|categoryIndex|originalCategory|
+---+--------+-------------+----------------+
|0  |mod1    |0.0          |mod1            |
|1  |mod2    |1.0          |mod2            |
|2  |mod3    |2.0          |mod3            |
|3  |mod1    |0.0          |mod1            |
|4  |mod1    |0.0          |mod1            |
|5  |mod2    |1.0          |mod2            |
+---+--------+-------------+----------------+



import org.apache.spark.ml.feature.{StringIndexer, IndexToString}
sq: Seq[(Int, String)] = List((0,mod1), (1,mod2), (2,mod3), (3,mod1), (4,mod1), (5,mod2))
dfm: org.apache.spark.sql.DataFrame = [id: int, category: string]
indexer: org.apache.spark.ml.feature.StringIndexer = strIdx_bf07832d19e1
indexed: org.apache.spark.sql.DataFrame = [id: int, category: string ... 1 more field]
converter: org.apache.spark.ml.feature.IndexToString = idxToStr_891d29052332
converted: org.apache.spark.sql.DataFrame = [id: int, category: string ... 2 more fields]


**OneHotEncoder :** crée une variable binaire pour chaque modalité d'une variable catégorielle.

In [8]:
import org.apache.spark.ml.feature.{OneHotEncoder}

val sq = Seq((0, "mod1"), (1, "mod2"), (2, "mod3"), (3, "mod1"), (4, "mod1"), (5, "mod2"))

val dfm = spark.createDataFrame(sq).toDF("id", "category")

val indexer = new StringIndexer()
                 .setInputCol("category")
                 .setOutputCol("categoryIndex").fit(dfm)

val indexed = indexer.transform(dfm) 

val encoder = new OneHotEncoder()
                 .setInputCol("categoryIndex")
                 .setOutputCol("categoryVec")
                 .fit(indexed)
val encoded = encoder.transform(indexed)
encoded.show()

+---+--------+-------------+-------------+
| id|category|categoryIndex|  categoryVec|
+---+--------+-------------+-------------+
|  0|    mod1|          0.0|(2,[0],[1.0])|
|  1|    mod2|          1.0|(2,[1],[1.0])|
|  2|    mod3|          2.0|    (2,[],[])|
|  3|    mod1|          0.0|(2,[0],[1.0])|
|  4|    mod1|          0.0|(2,[0],[1.0])|
|  5|    mod2|          1.0|(2,[1],[1.0])|
+---+--------+-------------+-------------+



import org.apache.spark.ml.feature.OneHotEncoder
sq: Seq[(Int, String)] = List((0,mod1), (1,mod2), (2,mod3), (3,mod1), (4,mod1), (5,mod2))
dfm: org.apache.spark.sql.DataFrame = [id: int, category: string]
indexer: org.apache.spark.ml.feature.StringIndexerModel = StringIndexerModel: uid=strIdx_af37b1408f98, handleInvalid=error
indexed: org.apache.spark.sql.DataFrame = [id: int, category: string ... 1 more field]
encoder: org.apache.spark.ml.feature.OneHotEncoderModel = OneHotEncoderModel: uid=oneHotEncoder_3098098b56f7, dropLast=true, handleInvalid=error
encoded: org.apache.spark.sql.DataFrame = [id: int, category: string ... 2 more fields]


**Normalizer** : normaliser les données en utilisant la norme Lp.

In [9]:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.feature.Normalizer

val dfm = spark.read.option("header", true).option("inferSchema", "true").csv("../data/iris.txt")
dfm.show(2)

val assembler = new VectorAssembler()
  .setInputCols(Array("sepal_l", "sepal_w", "petal_l", "petal_w"))
  .setOutputCol("features")

val dfmAssembled = assembler.transform(dfm)

dfmAssembled.show(2)

val normalizer = new Normalizer()
                  .setInputCol("features") 
                  .setOutputCol("normFeatures") 
                  .setP(2.0) 
val dfmNormalized = normalizer.transform(dfmAssembled)
dfmNormalized.show(2)

+-------+-------+-------+-------+-----------+
|sepal_l|sepal_w|petal_l|petal_w|     classe|
+-------+-------+-------+-------+-----------+
|    5.1|    3.5|    1.4|    0.2|Iris-setosa|
|    4.9|    3.0|    1.4|    0.2|Iris-setosa|
+-------+-------+-------+-------+-----------+
only showing top 2 rows

+-------+-------+-------+-------+-----------+-----------------+
|sepal_l|sepal_w|petal_l|petal_w|     classe|         features|
+-------+-------+-------+-------+-----------+-----------------+
|    5.1|    3.5|    1.4|    0.2|Iris-setosa|[5.1,3.5,1.4,0.2]|
|    4.9|    3.0|    1.4|    0.2|Iris-setosa|[4.9,3.0,1.4,0.2]|
+-------+-------+-------+-------+-----------+-----------------+
only showing top 2 rows

+-------+-------+-------+-------+-----------+-----------------+--------------------+
|sepal_l|sepal_w|petal_l|petal_w|     classe|         features|        normFeatures|
+-------+-------+-------+-------+-----------+-----------------+--------------------+
|    5.1|    3.5|    1.4|    0.2|Ir

import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.feature.Normalizer
dfm: org.apache.spark.sql.DataFrame = [sepal_l: double, sepal_w: double ... 3 more fields]
assembler: org.apache.spark.ml.feature.VectorAssembler = VectorAssembler: uid=vecAssembler_8de23667a13c, handleInvalid=error, numInputCols=4
dfmAssembled: org.apache.spark.sql.DataFrame = [sepal_l: double, sepal_w: double ... 4 more fields]
normalizer: org.apache.spark.ml.feature.Normalizer = Normalizer: uid=normalizer_caae2a61f658, p=2.0
dfmNormalized: org.apache.spark.sql.DataFrame = [sepal_l: double, sepal_w: double ... 5 more fields]


**StandardScaler** : Standardiser des données.

In [10]:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.feature.StandardScaler

val dfm = spark.read.option("header", true).option("inferSchema", "true").csv("../data/iris.txt")
dfm.show(2)

val assembler = new VectorAssembler()
  .setInputCols(Array("sepal_l", "sepal_w", "petal_l", "petal_w"))
  .setOutputCol("features")

val dfmAssembled = assembler.transform(dfm)

dfmAssembled.show(2)

val scaler = new StandardScaler()
  .setInputCol("features")
  .setOutputCol("scaledFeatures")
  .setWithStd(true)
  .setWithMean(true)

val scalerModel = scaler.fit(dfmAssembled)

val dfmScaled = scalerModel.transform(dfmAssembled)
dfmScaled.show(2)

+-------+-------+-------+-------+-----------+
|sepal_l|sepal_w|petal_l|petal_w|     classe|
+-------+-------+-------+-------+-----------+
|    5.1|    3.5|    1.4|    0.2|Iris-setosa|
|    4.9|    3.0|    1.4|    0.2|Iris-setosa|
+-------+-------+-------+-------+-----------+
only showing top 2 rows

+-------+-------+-------+-------+-----------+-----------------+
|sepal_l|sepal_w|petal_l|petal_w|     classe|         features|
+-------+-------+-------+-------+-----------+-----------------+
|    5.1|    3.5|    1.4|    0.2|Iris-setosa|[5.1,3.5,1.4,0.2]|
|    4.9|    3.0|    1.4|    0.2|Iris-setosa|[4.9,3.0,1.4,0.2]|
+-------+-------+-------+-------+-----------+-----------------+
only showing top 2 rows

+-------+-------+-------+-------+-----------+-----------------+--------------------+
|sepal_l|sepal_w|petal_l|petal_w|     classe|         features|      scaledFeatures|
+-------+-------+-------+-------+-----------+-----------------+--------------------+
|    5.1|    3.5|    1.4|    0.2|Ir

import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.feature.StandardScaler
dfm: org.apache.spark.sql.DataFrame = [sepal_l: double, sepal_w: double ... 3 more fields]
assembler: org.apache.spark.ml.feature.VectorAssembler = VectorAssembler: uid=vecAssembler_4ef06c3c7661, handleInvalid=error, numInputCols=4
dfmAssembled: org.apache.spark.sql.DataFrame = [sepal_l: double, sepal_w: double ... 4 more fields]
scaler: org.apache.spark.ml.feature.StandardScaler = stdScal_25baa67de171
scalerModel: org.apache.spark.ml.feature.StandardScalerModel = StandardScalerModel: uid=stdScal_25baa67de171, numFeatures=4, withMean=true, withStd=true
dfmScaled: org.apache.spark.sql.DataFrame = [sepal_l: double, sepal_w: double ... 5 more fields]


**Tokenizer** : transformer une phrase ou une ligne en un vecteur de mots.

In [11]:
import org.apache.spark.ml.feature.Tokenizer

val dfmPhrase= spark.createDataFrame(Seq(
            (0, "Je veux aller au cinema."),
            (1, "Le train est en retard."),
            (2, "La maison est tres belle.")
            )).toDF("id", "phrase")

val tokenizer = new Tokenizer().setInputCol("phrase").setOutputCol("mots")

val tokenized = tokenizer.transform(dfmPhrase)

tokenized.select("mots", "id").take(3).foreach(println)

[WrappedArray(je, veux, aller, au, cinema.),0]
[WrappedArray(le, train, est, en, retard.),1]
[WrappedArray(la, maison, est, tres, belle.),2]


import org.apache.spark.ml.feature.Tokenizer
dfmPhrase: org.apache.spark.sql.DataFrame = [id: int, phrase: string]
tokenizer: org.apache.spark.ml.feature.Tokenizer = tok_1f0a6151e0bd
tokenized: org.apache.spark.sql.DataFrame = [id: int, phrase: string ... 1 more field]


**Binarizer** : transformer une variable numérique en une variable binaire.

In [12]:
import org.apache.spark.ml.feature.Binarizer

val array = Array((0, 0.1), (1,  0.3), (2, 0.9), (3, 0.8),
                  (0, 0.0), (0, 0.6))
val dfm = spark.createDataFrame(array).toDF("id", "feature")

val binarizer = new Binarizer()
            .setInputCol("feature")
            .setOutputCol("label")
             .setThreshold(0.5)

val dfmBinarized = binarizer.transform(dfm)
dfmBinarized.show()

+---+-------+-----+
| id|feature|label|
+---+-------+-----+
|  0|    0.1|  0.0|
|  1|    0.3|  0.0|
|  2|    0.9|  1.0|
|  3|    0.8|  1.0|
|  0|    0.0|  0.0|
|  0|    0.6|  1.0|
+---+-------+-----+



import org.apache.spark.ml.feature.Binarizer
array: Array[(Int, Double)] = Array((0,0.1), (1,0.3), (2,0.9), (3,0.8), (0,0.0), (0,0.6))
dfm: org.apache.spark.sql.DataFrame = [id: int, feature: double]
binarizer: org.apache.spark.ml.feature.Binarizer = Binarizer: uid=binarizer_57d2f08f5512
dfmBinarized: org.apache.spark.sql.DataFrame = [id: int, feature: double ... 1 more field]


**Bucketizer ** : 

Bucketizer découpe une variable numérique en une variable catégorielle avec des classes. Les intervalles pour définir les classes sont fixés par l'utilisateur avec le paramètre `splits`. 

Par exemple une variable `age` peut découper en tranche d'âge.

In [13]:
import org.apache.spark.ml.feature.Bucketizer

val splits = Array(0, 2, 4, 6, 8, Double.PositiveInfinity)

val array = Array((0, 7), (1,  5), (2, 3), (3, 8),
                  (4, 4), (5, 6), (6, 1), (7, 13))
val dfm = spark.createDataFrame(array).toDF("id", "age")

dfm.show(2)

val bucketizer = new Bucketizer()
  .setInputCol("age")
  .setOutputCol("bucketedAge")
  .setSplits(splits)

// Transform original data into its bucket index.
val bucketedData = bucketizer.transform(dfm)

println(s"Bucketizer output with ${bucketizer.getSplits.length-1} buckets")
bucketedData.show()

+---+---+
| id|age|
+---+---+
|  0|  7|
|  1|  5|
+---+---+
only showing top 2 rows

Bucketizer output with 5 buckets
+---+---+-----------+
| id|age|bucketedAge|
+---+---+-----------+
|  0|  7|        3.0|
|  1|  5|        2.0|
|  2|  3|        1.0|
|  3|  8|        4.0|
|  4|  4|        2.0|
|  5|  6|        3.0|
|  6|  1|        0.0|
|  7| 13|        4.0|
+---+---+-----------+



import org.apache.spark.ml.feature.Bucketizer
splits: Array[Double] = Array(0.0, 2.0, 4.0, 6.0, 8.0, Infinity)
array: Array[(Int, Int)] = Array((0,7), (1,5), (2,3), (3,8), (4,4), (5,6), (6,1), (7,13))
dfm: org.apache.spark.sql.DataFrame = [id: int, age: int]
bucketizer: org.apache.spark.ml.feature.Bucketizer = Bucketizer: uid=bucketizer_a0bc3b24c907
bucketedData: org.apache.spark.sql.DataFrame = [id: int, age: int ... 1 more field]


## Les extracteurs (Feature extractors)

**IDF** : calcule le Inverse Document Frequency (IDF) pour une collection de documents données.

In [14]:
import org.apache.spark.ml.feature.HashingTF
import org.apache.spark.ml.feature.IDF
import org.apache.spark.ml.feature.Tokenizer

val dfmPhrase = spark.createDataFrame(Seq(
            (0.0, "Je veux aller au cinema."),
            (1.0, "Le train est en retard."),
            (0.0, "La maison est tres belle.")
            )).toDF("label", "sentence")

val tokenizer = new Tokenizer()
                  .setInputCol("sentence")
                  .setOutputCol("words")
val wordsData = tokenizer.transform(dfmPhrase)

val hashingTF = new HashingTF().setInputCol("words")
                   .setOutputCol("rawFeatures")
                   .setNumFeatures(10)

val featurizedData = hashingTF.transform(wordsData)

val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")

val idfModel = idf.fit(featurizedData)
val rescaledData = idfModel.transform(featurizedData)

rescaledData.select("label", "features").show(false)

+-----+-------------------------------------------------------------------------------------+
|label|features                                                                             |
+-----+-------------------------------------------------------------------------------------+
|0.0  |(10,[0,2,4,6,8],[0.0,0.6931471805599453,0.0,0.28768207245178085,0.28768207245178085])|
|1.0  |(10,[0,3,4,7],[0.0,0.6931471805599453,0.0,0.6931471805599453])                       |
|0.0  |(10,[0,1,4,6,8],[0.0,0.6931471805599453,0.0,0.28768207245178085,0.28768207245178085])|
+-----+-------------------------------------------------------------------------------------+



import org.apache.spark.ml.feature.HashingTF
import org.apache.spark.ml.feature.IDF
import org.apache.spark.ml.feature.Tokenizer
dfmPhrase: org.apache.spark.sql.DataFrame = [label: double, sentence: string]
tokenizer: org.apache.spark.ml.feature.Tokenizer = tok_da8893aebbce
wordsData: org.apache.spark.sql.DataFrame = [label: double, sentence: string ... 1 more field]
hashingTF: org.apache.spark.ml.feature.HashingTF = HashingTF: uid=hashingTF_8e7004cd51eb, binary=false, numFeatures=10
featurizedData: org.apache.spark.sql.DataFrame = [label: double, sentence: string ... 2 more fields]
idf: org.apache.spark.ml.feature.IDF = idf_55542dac47ee
idfModel: org.apache.spark.ml.feature.IDFModel = IDFModel: uid=idf_55542dac47ee, numDocs=3, numFeatures=10
rescaledData: org.apache.spark.sql...


**Word2Vec** : représente un document en un vecteur numérique de dimension fixe.

In [15]:
import org.apache.spark.ml.feature.Word2Vec
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row

val phraseDF = spark.createDataFrame(Seq(
  "Le train est en retard",
  "Le train entre en gare",
  "Les voyageurs descendent du train"
).map(x => x.split(" ")).map(Tuple1.apply)).toDF("phrase")

phraseDF.show()

// Learn a mapping from words to Vectors.
val word2Vec = new Word2Vec()
  .setInputCol("phrase")
  .setOutputCol("features")
  .setVectorSize(3)
  .setMinCount(0)

val model = word2Vec.fit(phraseDF)

val output = model.transform(phraseDF)

output.collect().foreach { case Row(text: Seq[_], features: Vector) =>
  println(s"Texte: [${text.mkString(", ")}] => \nVecteur: $features\n") }

+--------------------+
|              phrase|
+--------------------+
|[Le, train, est, ...|
|[Le, train, entre...|
|[Les, voyageurs, ...|
+--------------------+

Texte: [Le, train, est, en, retard] => 
Vecteur: [-0.06348072681576014,0.00417338686529547,0.03518190383911133]

Texte: [Le, train, entre, en, gare] => 
Vecteur: [-0.03118861708790064,0.03171334862709046,0.029982009530067445]

Texte: [Les, voyageurs, descendent, du, train] => 
Vecteur: [-0.03028365280479193,-0.02631254754960537,0.01358244437724352]



import org.apache.spark.ml.feature.Word2Vec
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row
phraseDF: org.apache.spark.sql.DataFrame = [phrase: array<string>]
word2Vec: org.apache.spark.ml.feature.Word2Vec = w2v_cd91e925ea31
model: org.apache.spark.ml.feature.Word2VecModel = Word2VecModel: uid=w2v_cd91e925ea31, numWords=11, vectorSize=3
output: org.apache.spark.sql.DataFrame = [phrase: array<string>, features: vector]


**CountVectorizer** : compte le nombre d'occurrence d'un mot dans un document.

In [16]:
import org.apache.spark.ml.feature.CountVectorizer

val phraseDF = spark.createDataFrame(Seq(
  "Le train est en retard",
  "Le train entre en gare",
  "Les voyageurs descendent du train"
).map(x => x.split(" ")).map(Tuple1.apply)).toDF("words")

phraseDF.show()

val countVectorizer = new CountVectorizer()
  .setInputCol("words")
  .setOutputCol("features")
  .setVocabSize(5)
  .setMinDF(0)

val model = countVectorizer.fit(phraseDF)

model.transform(phraseDF).show(false)
 
println("Vocabulaire = " + model.vocabulary.mkString(", "))

+--------------------+
|               words|
+--------------------+
|[Le, train, est, ...|
|[Le, train, entre...|
|[Les, voyageurs, ...|
+--------------------+

+---------------------------------------+-------------------------+
|words                                  |features                 |
+---------------------------------------+-------------------------+
|[Le, train, est, en, retard]           |(5,[0,1,2],[1.0,1.0,1.0])|
|[Le, train, entre, en, gare]           |(5,[0,1,2],[1.0,1.0,1.0])|
|[Les, voyageurs, descendent, du, train]|(5,[0,3,4],[1.0,1.0,1.0])|
+---------------------------------------+-------------------------+

Vocabulaire = train, en, Le, Les, du


import org.apache.spark.ml.feature.CountVectorizer
phraseDF: org.apache.spark.sql.DataFrame = [words: array<string>]
countVectorizer: org.apache.spark.ml.feature.CountVectorizer = cntVec_f213f9a63710
model: org.apache.spark.ml.feature.CountVectorizerModel = CountVectorizerModel: uid=cntVec_f213f9a63710, vocabularySize=5


## Les sélecteurs (Feature Selectors)

**VectorSlicer** permet de sélectionner une liste de colonnes.

Il prend en argument une liste d'indices de colonnes et/ou de nom de colonnes, puis produit un nouveau vecteur assembleur avec les colonnes sélectionnées :

* Indices des colonnes avec `setIndices`
* Nom des colonnes avec `setNames`

In [17]:
import java.util.Arrays

import org.apache.spark.ml.attribute.{Attribute, AttributeGroup, NumericAttribute}
import org.apache.spark.ml.feature.VectorSlicer
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.{Row}
import org.apache.spark.sql.types.StructType

val data = Arrays.asList(
                        Row(Vectors.dense(1.5, 3.9, 4.2)),
                        Row(Vectors.dense(-3.0, 1.3, 2.5))
                        )

val defaultAttr = NumericAttribute.defaultAttr
val attrs = Array("col1", "col2", "col3").map(defaultAttr.withName)

val attrGroup = new AttributeGroup("features", attrs.asInstanceOf[Array[Attribute]])

val dataset = spark.createDataFrame(data, StructType(Array(attrGroup.toStructField())))

val slicer = new VectorSlicer().setInputCol("features").setOutputCol("SelectedFeatures")

slicer.setIndices(Array(1)).setNames(Array("col3"))
// or slicer.setIndices(Array(1, 2)), or slicer.setNames(Array("col1", "col2"))

val output = slicer.transform(dataset)
output.show(false)

+--------------+----------------+
|features      |SelectedFeatures|
+--------------+----------------+
|[1.5,3.9,4.2] |[3.9,4.2]       |
|[-3.0,1.3,2.5]|[1.3,2.5]       |
+--------------+----------------+



import java.util.Arrays
import org.apache.spark.ml.attribute.{Attribute, AttributeGroup, NumericAttribute}
import org.apache.spark.ml.feature.VectorSlicer
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.StructType
data: java.util.List[org.apache.spark.sql.Row] = [[[1.5,3.9,4.2]], [[-3.0,1.3,2.5]]]
defaultAttr: org.apache.spark.ml.attribute.NumericAttribute = {"type":"numeric"}
attrs: Array[org.apache.spark.ml.attribute.NumericAttribute] = Array({"type":"numeric","name":"col1"}, {"type":"numeric","name":"col2"}, {"type":"numeric","name":"col3"})
attrGroup: org.apache.spark.ml.attribute.AttributeGroup = {"ml_attr":{"attrs":{"numeric":[{"idx":0,"name":"col1"},{"idx":1,"name":"col2"},{"idx":2,"name":"col3"}]},"num_attrs":3...


**ChiSqSelector :** sélectionne des variables catégorielles pour prédire une variable catégorielle en se basant sur le test d'indépendance du chi2.

In [18]:
import org.apache.spark.ml.feature.ChiSqSelector
import org.apache.spark.ml.linalg.Vectors

val data = Seq(
  (0, Vectors.dense(0.0, 5.0, 12.0, 1.0), 1.0),
  (1, Vectors.dense(5.0, 1.0, 1.0, 30.0), 0.0),
  (2, Vectors.dense(1.0, 6.0, 15.0, 1.0), 0.0),
  (3, Vectors.dense(8.0, 3.0, 10.0, 7.0), 1.0),
  (4, Vectors.dense(3.0, 1.0, 2.0, 11.0), 0.0),
  (5, Vectors.dense(1.0, 34.0, 5.0, 1.0), 1.0)
)

val dfm = spark.createDataFrame(data).toDF("id", "features", "label")

val selector = new ChiSqSelector()
  .setNumTopFeatures(2)
  .setFeaturesCol("features")
  .setLabelCol("label")
  .setOutputCol("selectedFeatures")

val result = selector.fit(dfm).transform(dfm)

println(s"ChiSqSelector output with top ${selector.getNumTopFeatures} features selected")
result.show()

ChiSqSelector output with top 2 features selected
+---+------------------+-----+----------------+
| id|          features|label|selectedFeatures|
+---+------------------+-----+----------------+
|  0|[0.0,5.0,12.0,1.0]|  1.0|      [5.0,12.0]|
|  1|[5.0,1.0,1.0,30.0]|  0.0|       [1.0,1.0]|
|  2|[1.0,6.0,15.0,1.0]|  0.0|      [6.0,15.0]|
|  3|[8.0,3.0,10.0,7.0]|  1.0|      [3.0,10.0]|
|  4|[3.0,1.0,2.0,11.0]|  0.0|       [1.0,2.0]|
|  5|[1.0,34.0,5.0,1.0]|  1.0|      [34.0,5.0]|
+---+------------------+-----+----------------+



import org.apache.spark.ml.feature.ChiSqSelector
import org.apache.spark.ml.linalg.Vectors
data: Seq[(Int, org.apache.spark.ml.linalg.Vector, Double)] = List((0,[0.0,5.0,12.0,1.0],1.0), (1,[5.0,1.0,1.0,30.0],0.0), (2,[1.0,6.0,15.0,1.0],0.0), (3,[8.0,3.0,10.0,7.0],1.0), (4,[3.0,1.0,2.0,11.0],0.0), (5,[1.0,34.0,5.0,1.0],1.0))
dfm: org.apache.spark.sql.DataFrame = [id: int, features: vector ... 1 more field]
selector: org.apache.spark.ml.feature.ChiSqSelector = chiSqSelector_b767b642980c
result: org.apache.spark.sql.DataFrame = [id: int, features: vector ... 2 more fields]
