# One-hot encoding
In one-hot encoding approach a new dummy feature is created for each
unique value in the nominal feature column.

# Bag-of-words
The idea behind the bag-of-words can be summarized as follows:
1. We create a vocabulary of unique words from the entire set of documents.
2. We construct a feature vector from each document that contains the counts of
occurring words.

In [1]:
import org.apache.spark.sql.functions._
import org.apache.spark.ml.feature.{Tokenizer, CountVectorizer}
import org.apache.spark.ml.linalg.Vector

## Create a set of documents

In [2]:
val df = List( (0,"The sun is shining"),
                (1,"The weather is sweet, sweet"),
                (2,"The sun is shining and the weather is sweet")).toDF("id","doc")

df = [id: int, doc: string]


[id: int, doc: string]

In [3]:
df.show()

+---+--------------------+
| id|                 doc|
+---+--------------------+
|  0|  The sun is shining|
|  1|The weather is sw...|
|  2|The sun is shinin...|
+---+--------------------+



## Tokenize the documents

In [4]:
val tokenizer = new Tokenizer().
  setInputCol("doc").
  setOutputCol("words")

tokenizer = tok_dd304cc5f270


tok_dd304cc5f270

In [5]:
val df_t = tokenizer.transform(df)
df_t.show()

+---+--------------------+--------------------+
| id|                 doc|               words|
+---+--------------------+--------------------+
|  0|  The sun is shining|[the, sun, is, sh...|
|  1|The weather is sw...|[the, weather, is...|
|  2|The sun is shinin...|[the, sun, is, sh...|
+---+--------------------+--------------------+



df_t = [id: int, doc: string ... 1 more field]


[id: int, doc: string ... 1 more field]

## Vectorize the documents

In [6]:
val cv = new CountVectorizer().
  setInputCol("words").
  setOutputCol("features").
  fit(df_t)



cv = cntVec_94b4e0a2fe93


cntVec_94b4e0a2fe93

In [7]:
val df_v = cv.transform(df_t)

df_v = [id: int, doc: string ... 2 more fields]


[id: int, doc: string ... 2 more fields]

In [8]:
df_v.show()

+---+--------------------+--------------------+--------------------+
| id|                 doc|               words|            features|
+---+--------------------+--------------------+--------------------+
|  0|  The sun is shining|[the, sun, is, sh...|(8,[0,1,2,5],[1.0...|
|  1|The weather is sw...|[the, weather, is...|(8,[0,1,3,4,7],[1...|
|  2|The sun is shinin...|[the, sun, is, sh...|(8,[0,1,2,3,4,5,6...|
+---+--------------------+--------------------+--------------------+



In [9]:
df_v.select("features").collect.map(row => row(0).asInstanceOf[Vector].toDense)

[[1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0], [1.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0], [2.0,2.0,1.0,1.0,1.0,1.0,1.0,0.0]]

## Vocabulary

In [10]:
cv.vocabulary

[the, is, sun, weather, sweet, shining, and, sweet,]