<h1> 저수준의 RDD API 패턴과 고수준 RDD API 패턴

<h3> 저수준 DSL, 데이터프레임 API 사용

In [3]:
import findspark
findspark.init()
from pyspark.sql import SparkSession

In [4]:
# RDD를 이용한 예제
spark = SparkSession.builder.appName("DataFrame").getOrCreate()
sc = spark.sparkContext
# (name, age) 형태의 튜플로 된 RDD를 생성한다.
dataRDD = sc.parallelize([("Brooke", 20), ("Denny", 31), ("Jules", 30), ("TD", 35), ("Brooke", 25)])

# 집계와 평균을 위한 람다 표현식과 함께 map, reduceByKey 트랜스포메이션 사용용
ageRDD = (dataRDD
          .map(lambda x: (x[0], (x[1], 1)))
          .reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))
          .map(lambda x: (x[0], x[1][0] / x[1][1]))
          )
print(ageRDD.collect())
spark.stop()

[('Denny', 31.0), ('TD', 35.0), ('Brooke', 22.5), ('Jules', 30.0)]


<h3>고수준 DSL, 데이터프레임 API 사용

In [5]:
import findspark
findspark.init()

In [6]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import avg
spark = SparkSession.builder.appName("DataFrame").getOrCreate()

data_df = spark.createDataFrame([("Brooke", 20), ("Denny", 31), ("Jules", 30), ("TD", 35), ("Brooke", 25)], ['name', 'age'])
avg_df = data_df.groupBy('name').agg(avg('age'))
avg_df.show()
input("Press Enter to terminate...")
spark.stop()

+------+--------+
|  name|avg(age)|
+------+--------+
|Brooke|    22.5|
| Denny|    31.0|
| Jules|    30.0|
|    TD|    35.0|
+------+--------+



<h1> 스키마 정의 방법

<h3> 프로그래밍 스타일

In [7]:
from pyspark.sql.types import *
schema = StructType([StructField('author', StringType(), False),
                     StructField('title',StringType(), False),
                     StructField('pages',IntegerType(),False)])

In [8]:
schema

StructType([StructField('author', StringType(), False), StructField('title', StringType(), False), StructField('pages', IntegerType(), False)])

<h3> DDL 사용

In [9]:
schema = 'author STRING, title STRING, pages INT'

In [10]:
schema

'author STRING, title STRING, pages INT'

<h1> 로우


In [11]:
from pyspark.sql import Row
from pyspark.sql import SparkSession

import findspark
findspark.init()

blog_row = Row(6, "Reynold", "Xin", "https://tinyurl.6", 255568, "3/2/2015", ["twitter","LinkedIn"])
blog_row[1]

'Reynold'

<h3> 빠른 탐색을 위한 DataFrame으로 변경

In [12]:
spark = SparkSession \
        .builder \
        .appName('DataFrame') \
        .getOrCreate()
rows = [Row("Matei Zaharia","CA"),Row("Reynold Xin", "CA")]
authors_df = spark.createDataFrame(rows,["Authors","State"])
authors_df.show()

+-------------+-----+
|      Authors|State|
+-------------+-----+
|Matei Zaharia|   CA|
|  Reynold Xin|   CA|
+-------------+-----+



<h1> 샌프란 시스코 예제

In [60]:
from pyspark.sql.types import *
from pyspark.sql import SparkSession

spark = SparkSession \
        .builder \
        .appName('San Francisco') \
        .getOrCreate()

#프로그래밍적인 방법으로 스키마 정의
fire_schema = StructType([StructField('CallNumber', IntegerType(),True),
                           StructField('UnitID', StringType(), True),
                           StructField('IncidentNumber',IntegerType(),True),
                           StructField('CallType',StringType(),True),
                           StructField('CallDate',StringType(),True),
                           StructField('WatchDate',StringType(),True),
                           StructField('CallFinalDisposition',StringType(),True),
                           StructField('AvailableDtTm',StringType(),True),
                           StructField('Address',StringType(),True),
                           StructField('City',StringType(),True),
                           StructField('Zipcode',IntegerType(),True),
                           StructField('Battalion',StringType(),True),
                           StructField('StationArea',StringType(),True),
                           StructField('Box',StringType(),True),
                           StructField('OriginalPriority',StringType(),True),
                           StructField('Priority',StringType(),True),
                           StructField('FinalPriority',IntegerType(),True),
                           StructField('ALSUnit',BooleanType(),True),
                           StructField('CallTypeGroup',StringType(),True),
                           StructField('NumAlarms',IntegerType(),True),
                           StructField('UnitType',StringType(),True),
                           StructField('UnitSequenceInCallDispatch',IntegerType(),True),
                           StructField('FirePreventionDistrict',StringType(),True),
                           StructField('SupervisorDistrict',StringType(),True),
                           StructField('Neighborhood',StringType(),True),
                           StructField('Location',StringType(),True),
                           StructField('RowID',StringType(),True),
                           StructField('Delay',FloatType(),True)
                           ])

#DataFrameReader 인터페이스로 CSV 파일을 읽는다.
sf_fire_file = "sf-fire-calls.csv"
fire_df = spark.read.csv(sf_fire_file, header=True, schema=fire_schema)
fire_df.show()

+----------+------+--------------+----------------+----------+----------+--------------------+--------------------+--------------------+----+-------+---------+-----------+----+----------------+--------+-------------+-------+-------------+---------+--------------+--------------------------+----------------------+------------------+--------------------+--------------------+-------------+---------+
|CallNumber|UnitID|IncidentNumber|        CallType|  CallDate| WatchDate|CallFinalDisposition|       AvailableDtTm|             Address|City|Zipcode|Battalion|StationArea| Box|OriginalPriority|Priority|FinalPriority|ALSUnit|CallTypeGroup|NumAlarms|      UnitType|UnitSequenceInCallDispatch|FirePreventionDistrict|SupervisorDistrict|        Neighborhood|            Location|        RowID|    Delay|
+----------+------+--------------+----------------+----------+----------+--------------------+--------------------+--------------------+----+-------+---------+-----------+----+----------------+--------+

<h3> 트랜스포메이션과 액션

<h5> 프로젝션과 필터

In [61]:
from pyspark.sql.functions import *

few_fire_df = (fire_df
               .select('IncidentNumber','AvailableDtTm','CallType')
               .where(col('CallType')!='Medical Incident'))
few_fire_df.show(5,truncate=False) # truncate=False	긴 문자열도 줄이지 않고 전부 출력

+--------------+----------------------+--------------+
|IncidentNumber|AvailableDtTm         |CallType      |
+--------------+----------------------+--------------+
|2003235       |01/11/2002 01:51:44 AM|Structure Fire|
|2003250       |01/11/2002 04:16:46 AM|Vehicle Fire  |
|2003259       |01/11/2002 06:01:58 AM|Alarms        |
|2003279       |01/11/2002 08:03:26 AM|Structure Fire|
|2003301       |01/11/2002 09:46:44 AM|Alarms        |
+--------------+----------------------+--------------+
only showing top 5 rows



<h5> 화재 신고로 기록된 CallType 종류 개수

In [62]:
from pyspark.sql.functions import *
(fire_df
 .select('CallType')
 .where(col('CallType').isNotNull())
 .agg(countDistinct('CallType').alias('DistinctCallTypes'))
 .show())

+-----------------+
|DistinctCallTypes|
+-----------------+
|               30|
+-----------------+



<h5> 신고 타입의 목록

In [63]:
from pyspark.sql.functions import *
(fire_df
 .select('CallType')
 .where(col('CallType').isNotNull())
 .distinct()
 .show(10),False)

+--------------------+
|            CallType|
+--------------------+
|Elevator / Escala...|
|  Aircraft Emergency|
|              Alarms|
|Odor (Strange / U...|
|Citizen Assist / ...|
|              HazMat|
|           Explosion|
|           Oil Spill|
|        Vehicle Fire|
|  Suspicious Package|
+--------------------+
only showing top 10 rows



(None, False)

<h5> 컬럼의 이름 변경 및 추가 삭제

In [67]:
new_fire_df = fire_df.withColumnRenamed('Delay','ResponseDelayedinMins')
(new_fire_df
 .select('ResponseDelayedinMins')
 .where(col('ResponseDelayedinMins')>5)
 .show(5,False))

+---------------------+
|ResponseDelayedinMins|
+---------------------+
|5.35                 |
|6.25                 |
|5.2                  |
|5.6                  |
|7.25                 |
+---------------------+
only showing top 5 rows



<h5> 문자열을 날짜 타입으로 변경

In [75]:
fire_ts_df = (new_fire_df
              .withColumn('IncidentDate', to_timestamp(col("CallDate"), "MM/dd/yyyy"))
              .drop('CallDate')
              .withColumn('OnWatchDate', to_timestamp(col('WatchDate'),'MM/dd/yyyy'))
              .drop('WatchDate')
              .withColumn('AvailableDtTS', to_timestamp(col('AvailableDtTm'), 'MM/dd/yyyy hh:mm:ss a'))
              .drop('AvailableDtTm'))


(fire_ts_df
 .select('IncidentDate','OnWatchDate','AvailableDtTs')
 .show(5,False))

+-------------------+-------------------+-------------------+
|IncidentDate       |OnWatchDate        |AvailableDtTs      |
+-------------------+-------------------+-------------------+
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 01:51:44|
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 03:01:18|
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 02:39:50|
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 04:16:46|
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 06:01:58|
+-------------------+-------------------+-------------------+
only showing top 5 rows

