# 문제: zscore, cdf 계산
성적 데이터는 n이 작지만, 정규 분포를 이룬다고 가정하자.

- 1. 성적 데이터로 DataFrame을 생성

- 2. zscore 컬럼을 생성. zscore를 계산하려면, 평균과 표준편차를 알아야한다. 계산식에 F 함수를 직접 사용하면 오류가 발생한다. 따로 평균과 표준편차를 구해서 계산식에서 사용해야한다.

- 3. cdf 컬럼을 생성. scipy.stats.norm.cdf() 함수는 데이터 타입을 float으로 맞춰주어야한다. cdf는 평균=0, 표준편차=1을 기본 값으로 누적확률을 계산한다.

In [1]:
import pyspark

spark = pyspark.sql.SparkSession\
    .builder\
    .master("local")\
    .appName("zscore_cdf")\
    .config(conf=pyspark.SparkConf())\
    .getOrCreate()

21/11/01 10:38:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/11/01 10:38:01 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [2]:
marks=[
    "김하나, English, 100",
    "김하나, Math, 80",
    "임하나, English, 70",
    "임하나, Math, 100",
    "김갑돌, English, 82.3",
    "김갑돌, Math, 98.5"
]

In [3]:
from pyspark.sql.types import FloatType

_marksRdd = spark.sparkContext.parallelize(marks).map(lambda x: x.split())
_marksDf = spark.createDataFrame(_marksRdd, schema=["name", "subject", "mark"])
_marksDf = _marksDf.withColumn("markF", _marksDf["mark"].cast(FloatType()))
_marksDf.show()

                                                                                

+-------+--------+----+-----+
|   name| subject|mark|markF|
+-------+--------+----+-----+
|김하나,|English,| 100|100.0|
|김하나,|   Math,|  80| 80.0|
|임하나,|English,|  70| 70.0|
|임하나,|   Math,| 100|100.0|
|김갑돌,|English,|82.3| 82.3|
|김갑돌,|   Math,|98.5| 98.5|
+-------+--------+----+-----+



In [4]:
_marksDf.printSchema()

root
 |-- name: string (nullable = true)
 |-- subject: string (nullable = true)
 |-- mark: string (nullable = true)
 |-- markF: float (nullable = true)



### zscore 계산 

F 함수를 udf 함수 내에서 사용할 수 없다. 따라서 mean, std 를 미리 구해놓고 이 값을 활용해야한다.

이는 아래와 같이 구할 수 있다.

In [5]:
from pyspark.sql import functions as F

_markStat = _marksDf.select(
    F.mean("markF").alias("mean"),
    F.stddev("markF").alias("std")
).collect()


In [10]:
_markStat[0]["mean"]

88.46666717529297

In [11]:
calcZscore = F.udf(lambda x: (x - _markStat[0]['mean']) / _markStat[0]['std'], FloatType())

_marksDf = _marksDf.withColumn("zscore", calcZscore(F.col("markF")))
_marksDf.show()

+-------+--------+----+-----+-----------+
|   name| subject|mark|markF|     zscore|
+-------+--------+----+-----+-----------+
|김하나,|English,| 100|100.0|  0.9020148|
|김하나,|   Math,|  80| 80.0| -0.6621728|
|임하나,|English,|  70| 70.0| -1.4442666|
|임하나,|   Math,| 100|100.0|  0.9020148|
|김갑돌,|English,|82.3| 82.3|-0.48229098|
|김갑돌,|   Math,|98.5| 98.5| 0.78470075|
+-------+--------+----+-----+-----------+



### cdf 계산

- norm.cdf()는 numpy.float64를 반환하는데, 이는 spark에서 사용하지 않는 데이터타입이다.

- float()으로 형변환을 하여 사용한다.

In [12]:
from scipy.stats import norm

type(norm.cdf(1))

numpy.float64

In [13]:
normCdf = F.udf(lambda x: float(norm.cdf(x)), FloatType())

In [16]:
_marksDf = _marksDf.withColumn("cdf", normCdf(F.col("zscore")))

### Window 함수를 사용하여 zscore 계산
- 전체에 대한 평균 점수를 컬럼으로 만드려면 Window 기능을 사용해야한다.

#### 전체 window
- 점수 평균이라고 하면, spark는 어떤 평균인지 모른다. 사람별 점수 평균인지, 과목별 평균인지 알려주어야한다.

In [17]:
import sys
from pyspark.sql.window import Window

byAll = Window.rowsBetween(-sys.maxsize, sys.maxsize)

**전체의 평균, 표준편차 컬럼을 만들고 계산**
- 전체 Window에 대해 평균, 표준편차와 과목별 평균을 계산해보자.

- 이 때, 평균, 표준편차를 컬럼으로 만든 후에 zscore를 계산해보자

In [20]:
from pyspark.sql import functions as F

_marksDf = _marksDf.withColumn("mean", F.avg(F.col("markF")).over(byAll))
_marksDf = _marksDf.withColumn("stddev", F.stddev(F.col("markF")).over(byAll))
_marksDf.show()


21/11/01 11:17:30 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
21/11/01 11:17:30 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


+-------+--------+----+-----+-----------+----------+-----------------+------------------+
|   name| subject|mark|markF|     zscore|       cdf|             mean|            stddev|
+-------+--------+----+-----+-----------+----------+-----------------+------------------+
|김하나,|English,| 100|100.0|  0.9020148| 0.8164755|88.46666717529297|12.786190172956093|
|김하나,|   Math,|  80| 80.0| -0.6621728|0.25393024|88.46666717529297|12.786190172956093|
|임하나,|English,|  70| 70.0| -1.4442666|  0.074332|88.46666717529297|12.786190172956093|
|임하나,|   Math,| 100|100.0|  0.9020148| 0.8164755|88.46666717529297|12.786190172956093|
|김갑돌,|English,|82.3| 82.3|-0.48229098|0.31479964|88.46666717529297|12.786190172956093|
|김갑돌,|   Math,|98.5| 98.5| 0.78470075|0.78368545|88.46666717529297|12.786190172956093|
+-------+--------+----+-----+-----------+----------+-----------------+------------------+



In [21]:
bySubject = Window.partitionBy("subject")

In [23]:
_marksDf = _marksDf.withColumn("meanBySubject", F.avg(F.col("markF")).over(bySubject))
_marksDf.show()

21/11/01 11:20:05 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
21/11/01 11:20:05 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


+-------+--------+----+-----+-----------+----------+-----------------+------------------+-----------------+
|   name| subject|mark|markF|     zscore|       cdf|             mean|            stddev|    meanBySubject|
+-------+--------+----+-----+-----------+----------+-----------------+------------------+-----------------+
|김하나,|English,| 100|100.0|  0.9020148| 0.8164755|88.46666717529297|12.786190172956093|84.10000101725261|
|임하나,|English,|  70| 70.0| -1.4442666|  0.074332|88.46666717529297|12.786190172956093|84.10000101725261|
|김갑돌,|English,|82.3| 82.3|-0.48229098|0.31479964|88.46666717529297|12.786190172956093|84.10000101725261|
|김하나,|   Math,|  80| 80.0| -0.6621728|0.25393024|88.46666717529297|12.786190172956093|92.83333333333333|
|임하나,|   Math,| 100|100.0|  0.9020148| 0.8164755|88.46666717529297|12.786190172956093|92.83333333333333|
|김갑돌,|   Math,|98.5| 98.5| 0.78470075|0.78368545|88.46666717529297|12.786190172956093|92.83333333333333|
+-------+--------+----+-----+-----------+-----

In [24]:
_marksDf = _marksDf.withColumn("zscore1", (F.col("markF") - F.col("mean")) / F.col("stddev"))
_marksDf.select("zscore", "zscore1").show()

21/11/01 11:22:44 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
21/11/01 11:22:44 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


+-----------+-------------------+
|     zscore|            zscore1|
+-----------+-------------------+
|  0.9020148|  0.902014804151829|
| -0.6621728| -0.662172786480269|
| -1.4442666| -1.444266581796318|
|  0.9020148|  0.902014804151829|
|-0.48229098|-0.4822909748814927|
| 0.78470075| 0.7847007348544217|
+-----------+-------------------+



**전체의 평균, 표준편차 컬럼을 만들디 않고 계산**
- 또는 Window 함수를 직접 사용하여 zscore를 계산할 수도 있다.

In [25]:
_marksDf = _marksDf\
    .withColumn("zscore2", (F.col("markF") - F.avg("markF").over(byAll)) / F.stddev("markF").over(byAll))

In [26]:
_marksDf.select("zscore", "zscore1", "zscore2").show()

21/11/01 11:26:07 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
21/11/01 11:26:07 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
21/11/01 11:26:07 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


+-----------+-------------------+-------------------+
|     zscore|            zscore1|            zscore2|
+-----------+-------------------+-------------------+
|  0.9020148|  0.902014804151829|  0.902014804151829|
| -0.6621728| -0.662172786480269| -0.662172786480269|
| -1.4442666| -1.444266581796318| -1.444266581796318|
|  0.9020148|  0.902014804151829|  0.902014804151829|
|-0.48229098|-0.4822909748814927|-0.4822909748814927|
| 0.78470075| 0.7847007348544217| 0.7847007348544217|
+-----------+-------------------+-------------------+

