## Word2Vec

- 단어를 vector로 변환하는 방법을 말한다.

- __Bag Of Words 모델은 단어 순서와 문맥을 무시한다.__

- 이러한 단점을 극복하기 위해 __BoW로 부터 단어들 간의 맥락 또는 연관성 Word Embedding을 신경망으로 학습__하여 Word2Vec을 계산한다.

- 단어들을 vector로 변환하게 되면 서로 간의 거리를 측정하여 연산이 가능해진다.

- vectorSize는 단어 벡터를 몇 개로 구성할지, minCount는 최소 단어 빈도를 설정할 수 있다.

In [1]:
import pyspark

spark = pyspark.sql.SparkSession\
    .builder\
    .master("local")\
    .appName("w2v")\
    .config(conf=pyspark.SparkConf())\
    .getOrCreate()

21/11/15 09:17:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [2]:
doc2d=[
    ["When I find myself in times of trouble"],
    ["Mother Mary comes to me"],
    ["Speaking words of wisdom, let it be"],
    ["And in my hour of darkness"],
    ["She is standing right in front of me"],
    ["Speaking words of wisdom, let it be"],
    [u"우리 Let it be"],
    [u"나 Let it be"],
    [u"너 Let it be"],
    ["Let it be"],
    ["Whisper words of wisdom, let it be"]
]

In [3]:
myDf = spark.createDataFrame(doc2d, ["sent"])

In [4]:
from pyspark.ml.feature import Tokenizer

tokenizer = Tokenizer(inputCol="sent", outputCol="words")
tokDf = tokenizer.transform(myDf)

In [5]:
from pyspark.ml.feature import Word2Vec

word2vec = Word2Vec(vectorSize=3, minCount=0, inputCol="words", outputCol="w2v")
model = word2vec.fit(tokDf)
w2vDf = model.transform(tokDf)

21/11/15 09:19:38 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
21/11/15 09:19:38 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS


In [7]:
for e in w2vDf.select("w2v").take(3):
    print(e)

Row(w2v=DenseVector([0.0199, -0.0069, -0.0002]))
Row(w2v=DenseVector([0.0866, -0.0532, -0.0113]))
Row(w2v=DenseVector([0.0298, 0.0029, 0.0186]))


In [9]:
model.getVectors().show(truncate=False)

+--------+------------------------------------------------------------------+
|word    |vector                                                            |
+--------+------------------------------------------------------------------+
|trouble |[0.08160634338855743,-0.043025512248277664,0.10656099021434784]   |
|mother  |[0.14179208874702454,-0.015200603753328323,0.1217670664191246]    |
|find    |[-0.01224907673895359,0.026784956455230713,0.10260768979787827]   |
|standing|[-0.041652701795101166,0.037779148668050766,0.0017204727046191692]|
|wisdom, |[0.10510547459125519,-0.09538917988538742,-0.02657231315970421]   |
|in      |[-0.008711651898920536,0.1659986525774002,-0.11379893869161606]   |
|myself  |[0.15553179383277893,-0.04055459052324295,-0.01095652673393488]   |
|is      |[-0.045858271420001984,0.08375487476587296,-0.10092845559120178]  |
|darkness|[0.13663016259670258,0.024674516171216965,-0.006880973465740681]  |
|우리    |[-0.15229535102844238,0.12856164574623108,0.125185593962

## NGram

- text를 대상으로 하면, n-gram은 연속된 n개의 token으로 구성된 Series를 말한다.

- unigram은 한 단어로, bigram은 두 단어로 구성한다.

In [10]:
from pyspark.ml.feature import NGram

ngram = NGram(n=2, inputCol="words", outputCol="ngrams")
ngramDf = ngram.transform(tokDf)

In [11]:
ngramDf.show(truncate=False)

+--------------------------------------+-----------------------------------------------+--------------------------------------------------------------------------+
|sent                                  |words                                          |ngrams                                                                    |
+--------------------------------------+-----------------------------------------------+--------------------------------------------------------------------------+
|When I find myself in times of trouble|[when, i, find, myself, in, times, of, trouble]|[when i, i find, find myself, myself in, in times, times of, of trouble]  |
|Mother Mary comes to me               |[mother, mary, comes, to, me]                  |[mother mary, mary comes, comes to, to me]                                |
|Speaking words of wisdom, let it be   |[speaking, words, of, wisdom,, let, it, be]    |[speaking words, words of, of wisdom,, wisdom, let, let it, it be]        |
|And in my hour 

## StringIndexed

- string column을 index column으로 변환한다.

- __빈도가 높은 순서__대로 0.0부터 index가 주어진다.

- index는 double형을 갖게 되며 없는 label에 대해서 예외가 발생할 수 있으므로 (default), setHandleInvalid("skip")과 같은 함수를 사용가능하고 arg로는 "skip", "keep", "error"등으로 설정가능


| 구분 | 설명 | 예 |
|----|------|-------------|
| norminal | 명목 또는 구분 값 category | 사자, 호랑이, 사람 |
| ordinal | 명목값과 다른 점은 순서가 있다. | 키 low, mid, high |
|interval | 일정한 간격이 있다. | 150-165, 165-180, 180-195 |

In [12]:
from pyspark.ml.feature import StringIndexer

labelIndexer = StringIndexer(inputCol="sent", outputCol="sentLabel")

In [13]:
model = labelIndexer.fit(myDf)
siDf = model.transform(myDf)
siDf.show(truncate=False)

+--------------------------------------+---------+
|sent                                  |sentLabel|
+--------------------------------------+---------+
|When I find myself in times of trouble|5.0      |
|Mother Mary comes to me               |3.0      |
|Speaking words of wisdom, let it be   |0.0      |
|And in my hour of darkness            |1.0      |
|She is standing right in front of me  |4.0      |
|Speaking words of wisdom, let it be   |0.0      |
|우리 Let it be                        |9.0      |
|나 Let it be                          |7.0      |
|너 Let it be                          |8.0      |
|Let it be                             |2.0      |
|Whisper words of wisdom, let it be    |6.0      |
+--------------------------------------+---------+



## One-hot Encoding

- 앞서 StringIndexed는 0 < 1 < 2 .. 처럼 순서가 있는 것으로 보여질 수 있다. 그러나 실제로는 순서가 없다.

- One-hot encoding은 명목 변수 인덱스를 이진 벡터로 변환하여, 서로 순서가 없도록 한다.

| 명목 값 | 이진 벡터 | Sparse Vector |
| ----- | ------- | ------------- |
|A.     | 10.     | (2, [0], [1.0]) |
|B      | 01.     | (2, [1], [1.0]) |
|C.     | 00.     | (2, [], [])   |

In [14]:
df = spark.createDataFrame([
    (1, "B"),
    (2, "C"),
    (3, "A"),
    (4, "B"),
    (5, "C"),
    (6, "A")
], ["id", "grade"])

In [17]:
from pyspark.ml.feature import StringIndexer

stringIndexer = StringIndexer(inputCol="grade", outputCol="gradeIndex")
model = stringIndexer.fit(df)
indexed = model.transform(df)

In [18]:
from pyspark.ml.feature import OneHotEncoder

encoder = OneHotEncoder(inputCol="gradeIndex", outputCol="gradeVec")
encoded = encoder.fit(indexed)

In [19]:
encoded.transform(indexed).show()

+---+-----+----------+-------------+
| id|grade|gradeIndex|     gradeVec|
+---+-----+----------+-------------+
|  1|    B|       1.0|(2,[1],[1.0])|
|  2|    C|       2.0|    (2,[],[])|
|  3|    A|       0.0|(2,[0],[1.0])|
|  4|    B|       1.0|(2,[1],[1.0])|
|  5|    C|       2.0|    (2,[],[])|
|  6|    A|       0.0|(2,[0],[1.0])|
+---+-----+----------+-------------+



##  연속 데이터의 변환

In [21]:
!cat data/ds_spark_heightweights.txt

1	65.78	112.99
2	71.52	136.49
3	69.40	153.03
4	68.22	142.34
5	67.79	144.30
6	68.70	123.30
7	69.80	141.49
8	70.01	136.46
9	67.90	112.37
10	66.78	120.67
11	66.49	127.45
12	67.62	114.14
13	68.30	125.61
14	67.12	122.46
15	68.28	116.09
16	71.09	140.00
17	66.46	129.50
18	68.65	142.97
19	71.23	137.90
20	67.13	124.04
21	67.83	141.28
22	68.88	143.54
23	63.48	97.90
24	68.42	129.50
25	67.63	141.85
26	67.21	129.72
27	70.84	142.42
28	67.49	131.55
29	66.53	108.33
30	65.44	113.89
31	69.52	103.30
32	65.81	120.75
33	67.82	125.79
34	70.60	136.22
35	71.80	140.10
36	69.21	128.75
37	66.80	141.80
38	67.66	121.23
39	67.81	131.35
40	64.05	106.71
41	68.57	124.36
42	65.18	124.86
43	69.66	139.67
44	67.97	137.37
45	65.98	106.45
46	68.67	128.76
47	66.88	145.68
48	67.70	116.82
49	69.82	143.62
50	69.09	134.93


In [30]:
import os

rdd = spark.sparkContext.textFile(os.path.join("data", "ds_spark_heightweights.txt"))

In [29]:
myRdd = rdd.map(lambda x: [float(_x) for _x in x.split('\t')])
myDf = spark.createDataFrame(myRdd, ["id", "weight", "height"])

In [31]:
myDf.printSchema()

root
 |-- id: double (nullable = true)
 |-- weight: double (nullable = true)
 |-- height: double (nullable = true)



In [33]:
from pyspark.ml.feature import Binarizer

binarizer = Binarizer(threshold=68.0, inputCol="weight", outputCol="weight2")
binDf = binarizer.transform(myDf)

In [34]:
binDf.show()

+----+------+------+-------+
|  id|weight|height|weight2|
+----+------+------+-------+
| 1.0| 65.78|112.99|    0.0|
| 2.0| 71.52|136.49|    1.0|
| 3.0|  69.4|153.03|    1.0|
| 4.0| 68.22|142.34|    1.0|
| 5.0| 67.79| 144.3|    0.0|
| 6.0|  68.7| 123.3|    1.0|
| 7.0|  69.8|141.49|    1.0|
| 8.0| 70.01|136.46|    1.0|
| 9.0|  67.9|112.37|    0.0|
|10.0| 66.78|120.67|    0.0|
|11.0| 66.49|127.45|    0.0|
|12.0| 67.62|114.14|    0.0|
|13.0|  68.3|125.61|    1.0|
|14.0| 67.12|122.46|    0.0|
|15.0| 68.28|116.09|    1.0|
|16.0| 71.09| 140.0|    1.0|
|17.0| 66.46| 129.5|    0.0|
|18.0| 68.65|142.97|    1.0|
|19.0| 71.23| 137.9|    1.0|
|20.0| 67.13|124.04|    0.0|
+----+------+------+-------+
only showing top 20 rows



In [35]:
from pyspark.ml.feature import QuantileDiscretizer

discretizer = QuantileDiscretizer(numBuckets=3, inputCol="height", outputCol="height3")
qdDf = discretizer.fit(binDf).transform(binDf)

In [36]:
qdDf.show()

+----+------+------+-------+-------+
|  id|weight|height|weight2|height3|
+----+------+------+-------+-------+
| 1.0| 65.78|112.99|    0.0|    0.0|
| 2.0| 71.52|136.49|    1.0|    1.0|
| 3.0|  69.4|153.03|    1.0|    2.0|
| 4.0| 68.22|142.34|    1.0|    2.0|
| 5.0| 67.79| 144.3|    0.0|    2.0|
| 6.0|  68.7| 123.3|    1.0|    0.0|
| 7.0|  69.8|141.49|    1.0|    2.0|
| 8.0| 70.01|136.46|    1.0|    1.0|
| 9.0|  67.9|112.37|    0.0|    0.0|
|10.0| 66.78|120.67|    0.0|    0.0|
|11.0| 66.49|127.45|    0.0|    1.0|
|12.0| 67.62|114.14|    0.0|    0.0|
|13.0|  68.3|125.61|    1.0|    1.0|
|14.0| 67.12|122.46|    0.0|    0.0|
|15.0| 68.28|116.09|    1.0|    0.0|
|16.0| 71.09| 140.0|    1.0|    2.0|
|17.0| 66.46| 129.5|    0.0|    1.0|
|18.0| 68.65|142.97|    1.0|    2.0|
|19.0| 71.23| 137.9|    1.0|    2.0|
|20.0| 67.13|124.04|    0.0|    1.0|
+----+------+------+-------+-------+
only showing top 20 rows



## VectorAssembler

- columns를 묶어서 Vector Row로 만든다. feature column을 생성할 경우에 사용한다.

- 단 문자열은 묶을 수 없다.

In [37]:
from pyspark.ml.feature import VectorAssembler

va = VectorAssembler(inputCols=["weight2", "height3"], outputCol="features")
vaDf = va.transform(qdDf)

In [38]:
vaDf.printSchema()

root
 |-- id: double (nullable = true)
 |-- weight: double (nullable = true)
 |-- height: double (nullable = true)
 |-- weight2: double (nullable = true)
 |-- height3: double (nullable = true)
 |-- features: vector (nullable = true)



In [39]:
vaDf.show()

+----+------+------+-------+-------+---------+
|  id|weight|height|weight2|height3| features|
+----+------+------+-------+-------+---------+
| 1.0| 65.78|112.99|    0.0|    0.0|(2,[],[])|
| 2.0| 71.52|136.49|    1.0|    1.0|[1.0,1.0]|
| 3.0|  69.4|153.03|    1.0|    2.0|[1.0,2.0]|
| 4.0| 68.22|142.34|    1.0|    2.0|[1.0,2.0]|
| 5.0| 67.79| 144.3|    0.0|    2.0|[0.0,2.0]|
| 6.0|  68.7| 123.3|    1.0|    0.0|[1.0,0.0]|
| 7.0|  69.8|141.49|    1.0|    2.0|[1.0,2.0]|
| 8.0| 70.01|136.46|    1.0|    1.0|[1.0,1.0]|
| 9.0|  67.9|112.37|    0.0|    0.0|(2,[],[])|
|10.0| 66.78|120.67|    0.0|    0.0|(2,[],[])|
|11.0| 66.49|127.45|    0.0|    1.0|[0.0,1.0]|
|12.0| 67.62|114.14|    0.0|    0.0|(2,[],[])|
|13.0|  68.3|125.61|    1.0|    1.0|[1.0,1.0]|
|14.0| 67.12|122.46|    0.0|    0.0|(2,[],[])|
|15.0| 68.28|116.09|    1.0|    0.0|[1.0,0.0]|
|16.0| 71.09| 140.0|    1.0|    2.0|[1.0,2.0]|
|17.0| 66.46| 129.5|    0.0|    1.0|[0.0,1.0]|
|18.0| 68.65|142.97|    1.0|    2.0|[1.0,2.0]|
|19.0| 71.23|

## Pipeline

- __Pipeline__은 여러 Estimators를 묶은 Estimator를 반환한다.

- Pipeline은 여러 작업을 묶어, 순서대로 단계적으로 Estimator를 적용하기 위해 사용한다.

In [40]:
df = spark.createDataFrame([
        (0, "a b c d e spark", 1.0),
        (1, "b d", 0.0),
        (2, "spark f g h", 1.0),
        (3, "hadoop mapreduce", 0.0),
        (4, "my dog has flea problems. help please.",0.0)
    ], ["id", "text", "label"])

In [42]:
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.classification import LogisticRegression

tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.01)

In [None]:
from pyspark.ml import Pipeline

pipeline = Pipeline(
    stages=[tokenizer,
           hashingTF,
           ]
)