RDD API

https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-distributed-datasets-rdds

- 创建RDD
- map
- filter
- groupByKey
- groupBy
- reduceByKey
- sortByKey
- distinct
- join
- leftOuterJoin
- rightOuterJoin
- fullOuterJoin
- count
- countByKey
- foreach
- randomSplit
- union
- intersection
- subtract
- cartesian

In [1]:
sc

# 创建RDD
RDD是无schema的数据结构，不同于DataFrame。
1. 用 .parallelize 集合，list或array
2. 外部文件 textFile

- 使用程序中的集合创建RDD（主要用于测试）

In [2]:
rdd1 = sc.parallelize(
    [('Ferrari', 'fast'), {'Porsche', 10000}, ['Spain', 'visited', 4504]], 4)
rdd1.collect()

[('Ferrari', 'fast'), {10000, 'Porsche'}, ['Spain', 'visited', 4504]]

In [3]:
rdd1.collect()[0]

('Ferrari', 'fast')

In [4]:
rdd1.collect()[1]

{10000, 'Porsche'}

In [5]:
data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)
distData.collect()

[1, 2, 3, 4, 5]

- 外部文件

In [8]:
rdd2 = sc.textFile('./data/VS14MORT.txt.gz', 4)

In [9]:
rdd2.take(1)

['                   1                                          2101  M1087 432311  4M4                2014U7CN                                    I64 238 070   24 0111I64                                                                                                                                                                           01 I64                                                                                                  01  11                                 100 601']

## map
将函数作用于数据集的每一个元素上。

In [11]:
rdd1 = sc.parallelize(["b", "a", "c"])
rdd2 = rdd1.map(lambda x: (x, 1))
sorted(rdd2.collect())

[('a', 1), ('b', 1), ('c', 1)]

## filter
返回所有 funtion 返回值为True的函数。

In [12]:
rdd = sc.parallelize([1, 2, 3, 4, 5])
rdd.filter(lambda x: x % 2 == 0).collect()

[2, 4]

## flatMap
首先应用所有元素，然后展开。

In [13]:
r1 = sc.parallelize(["hello zeropython", "hello 168seo.cn"])
r2 = r1.flatMap(lambda x: x.split(" "))
r3 = r1.map(lambda x: x.split(" "))

print(r2.collect())
print(r3.collect())

['hello', 'zeropython', 'hello', '168seo.cn']
[['hello', 'zeropython'], ['hello', '168seo.cn']]


## groupBykey
相同key的数据分成一组。

In [15]:
from _operator import add

data = ["hello zeropython", "hello 168seo.cn"]
r1 = sc.parallelize(data)
r2 = r1.flatMap(lambda x: x.split(" ")).map(lambda y: (y, 1))
print("r2", r2.collect())
r3 = r2.groupByKey()
print("r3", r3.collect())
r4 = r3.map(lambda x: {x[0]: list(x[1])})
print("r4", r4.collect())
print(r2.reduceByKey(add).collect())

r2 [('hello', 1), ('zeropython', 1), ('hello', 1), ('168seo.cn', 1)]
r3 [('hello', <pyspark.resultiterable.ResultIterable object at 0x7fb599b53790>), ('168seo.cn', <pyspark.resultiterable.ResultIterable object at 0x7fb599b607c0>), ('zeropython', <pyspark.resultiterable.ResultIterable object at 0x7fb599b60820>)]
r4 [{'hello': [1, 1]}, {'168seo.cn': [1]}, {'zeropython': [1]}]
[('hello', 2), ('168seo.cn', 1), ('zeropython', 1)]


## groupBy

In [16]:
data = [1, 2, 3, 5]
intRDD = sc.parallelize(data)
result = intRDD.groupBy(lambda x: x % 2).collect()
sorted([(x, sorted(y)) for (x, y) in result])

[(0, [2]), (1, [1, 3, 5])]

## reduceByKey
相同key的数据分发到一起，并进行运算。

In [17]:
data_key = sc.parallelize([('a', 4), ('b', 3), ('c', 2), ('a', 8), ('d', 2),
                           ('b', 1), ('d', 3)], 4)
data_key.reduceByKey(lambda x, y: x + y).collect()

[('b', 4), ('c', 2), ('a', 12), ('d', 5)]

In [18]:
data = ["hello zeropython", "hello 168seo.cn"]

# print(list(data))
r1 = sc.parallelize(data)

r2 = r1.flatMap(lambda x: x.split(" ")).map(lambda x: (x, 1))

print("r2", r2.collect())
r3 = r2.reduceByKey(lambda x, y: x + y)

print("r3", r3.collect())

r2 [('hello', 1), ('zeropython', 1), ('hello', 1), ('168seo.cn', 1)]
r3 [('hello', 2), ('168seo.cn', 1), ('zeropython', 1)]


## sortBykey

In [20]:
sc.setLogLevel("ERROR")
data = [
    "hello zeropython", "hwlldsf world", "168seo.cn", "168seo.cn",
    "hello 168seo.cn"
]

# print(list(data))
r1 = sc.parallelize(data)

r2 = r1.flatMap(lambda x:x.split(" "))\
    .map(lambda y:(y,1))\
    .reduceByKey(lambda x,y:x+y)\
    .sortByKey(lambda x:x[1])
# sortByKey排序根据关键词的值进行排序
# reduceByKey 让[("a",[1,1,1,1])] 转换成 [("a",3)]
print(r2.collect())

[('168seo.cn', 3), ('hello', 2), ('hwlldsf', 1), ('world', 1), ('zeropython', 1)]


# 集合

## distinct

In [21]:
rdd = sc.parallelize([1, 1, 2, 3])
sorted(rdd.distinct().collect())

[1, 2, 3]

## join

In [24]:
rd1 = sc.parallelize([('a', 1), ('b', 4), ('c', 10)])
rd2 = sc.parallelize([('a', 4), ('a', 1), ('b', '6'), ('d', 15)])
rd3 = rd1.join(rd2)
rd3.collect()

[('b', (4, '6')), ('a', (1, 4)), ('a', (1, 1))]

## leftOuterJoin

In [25]:
rd1 = sc.parallelize([('a', 1), ('b', 4), ('c', 10)])
rd2 = sc.parallelize([('a', 4), ('a', 1), ('b', '6'), ('d', 15)])
rd3 = rd1.leftOuterJoin(rd2)
rd3.collect()

[('b', (4, '6')), ('c', (10, None)), ('a', (1, 4)), ('a', (1, 1))]

## rightOuterJoin

In [26]:
rd1 = sc.parallelize([('a', 1), ('b', 4), ('c', 10)])
rd2 = sc.parallelize([('a', 4), ('a', 1), ('b', '6'), ('d', 15)])
rd3 = rd1.rightOuterJoin(rd2)
rd3.collect()

[('b', (4, '6')), ('a', (1, 4)), ('a', (1, 1)), ('d', (None, 15))]

## fullOuterJoin

In [27]:
rd1 = sc.parallelize([('a', 1), ('b', 4), ('c', 10)])
rd2 = sc.parallelize([('a', 4), ('a', 1), ('b', '6'), ('d', 15)])
rd3 = rd1.fullOuterJoin(rd2)
rd3.collect()

[('b', (4, '6')),
 ('c', (10, None)),
 ('a', (1, 4)),
 ('a', (1, 1)),
 ('d', (None, 15))]

## count

In [28]:
data_key.count()

7

## countByKey

In [29]:
data_key.countByKey()

defaultdict(int, {'a': 2, 'b': 2, 'c': 1, 'd': 2})

In [30]:
data_key.countByKey().items()

dict_items([('a', 2), ('b', 2), ('c', 1), ('d', 2)])

## foreach

In [31]:
def f(x):
    print(x)


data_key.foreach(f)

## randomSplit

In [32]:
intRDD = sc.parallelize([3, 1, 2, 5, 5])
stringRDD = sc.parallelize(['Apple', 'Orange', 'Grape', 'Banana', 'Apple'])
sRDD = intRDD.randomSplit([0.4, 0.6])
print(len(sRDD))
print(sRDD[0].collect())
print(sRDD[1].collect())

2
[5, 5]
[3, 1, 2]


## union 合集

In [26]:
rdd = sc.parallelize([1, 1, 2, 3])
rdd.union(rdd).collect()

[1, 1, 2, 3, 1, 1, 2, 3]

## intersection 并集

In [6]:
rd1 = sc.parallelize([('a', 1), ('b', 4), ('c', 10)])
rd2 = sc.parallelize([('a', 4), ('a', 1), ('b', '6'), ('d', 15)])
rd3 = rd1.intersection(rd2)
rd3.collect()

[('a', 1)]

## subtract 差集

In [31]:
rd1 = sc.parallelize([('a', 1), ('b', 4), ('c', 10)])
rd2 = sc.parallelize([('a', 4), ('a', 1), ('b', '6'), ('d', 15)])
rd3 = rd1.subtract(rd2)
rd3.collect()

[('b', 4), ('c', 10)]

## cartesian 笛卡尔积

In [32]:
rd1 = sc.parallelize([('a', 1), ('b', 4), ('c', 10)])
rd2 = sc.parallelize([('a', 4), ('a', 1), ('b', '6'), ('d', 15)])
rd3 = rd1.cartesian(rd2)
rd3.collect()

[(('a', 1), ('a', 4)),
 (('a', 1), ('a', 1)),
 (('a', 1), ('b', '6')),
 (('a', 1), ('d', 15)),
 (('b', 4), ('a', 4)),
 (('b', 4), ('a', 1)),
 (('b', 4), ('b', '6')),
 (('b', 4), ('d', 15)),
 (('c', 10), ('a', 4)),
 (('c', 10), ('a', 1)),
 (('c', 10), ('b', '6')),
 (('c', 10), ('d', 15))]