RDD API

https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-distributed-datasets-rdds

- 创建RDD
- map
- filter
- groupByKey
- groupBy
- reduceByKey
- sortByKey
- distinct
- join
- leftOuterJoin
- rightOuterJoin
- fullOuterJoin
- count
- countByKey
- foreach
- randomSplit
- union
- intersection
- subtract
- cartesian

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
      .master("local[1]") \
      .appName("SparkByExamples.com") \
      .getOrCreate()

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:50640)
Traceback (most recent call last):
  File "/Users/liuning/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 977, in _get_connection
    connection = self.deque.pop()
IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/liuning/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1115, in start
    self.socket.connect((self.address, self.port))
ConnectionRefusedError: [Errno 61] Connection refused
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:50640)
Traceback (most recent call last):
  File "/Users/liuning/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 977, in _get_connection
    connection = self.deque.pop()
IndexError: pop from an

Py4JNetworkError: An error occurred while trying to connect to the Java server (127.0.0.1:50640)

In [None]:
sc=spark.ssqlContext

RDD是无schema的数据结构，不同于DataFrame。
1. 用 .parallelize 集合，list或array
2. 外部文件 textFile

- 使用程序中的集合创建RDD（主要用于测试）

In [None]:
rdd1 = sc.parallelize(
    [('Ferrari', 'fast'), {'Porsche', 10000}, ['Spain', 'visited', 4504]], 4)
rdd1.collect()

In [None]:
rdd1.collect()[0]

In [None]:
rdd1.collect()[1]

In [None]:
data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)
distData.collect()

- 外部文件

In [None]:
rdd2 = sc.textFile('./data/VS14MORT.txt.gz', 4)

In [None]:
rdd2.take(1)

## map
将函数作用于数据集的每一个元素上。

In [None]:
rdd1 = sc.parallelize(["b", "a", "c"])
rdd2 = rdd1.map(lambda x: (x, 1))
sorted(rdd2.collect())

## filter
返回所有 funtion 返回值为True的函数。

In [None]:
rdd = sc.parallelize([1, 2, 3, 4, 5])
rdd.filter(lambda x: x % 2 == 0).collect()

## flatMap
首先应用所有元素，然后展开。

In [None]:
r1 = sc.parallelize(["hello zeropython", "hello 168seo.cn"])
r2 = r1.flatMap(lambda x: x.split(" "))
r3 = r1.map(lambda x: x.split(" "))

print(r2.collect())
print(r3.collect())

## groupBykey
相同key的数据分成一组。

In [None]:
from _operator import add

data = ["hello zeropython", "hello 168seo.cn"]
r1 = sc.parallelize(data)
r2 = r1.flatMap(lambda x: x.split(" ")).map(lambda y: (y, 1))
print("r2", r2.collect())
r3 = r2.groupByKey()
print("r3", r3.collect())
r4 = r3.map(lambda x: {x[0]: list(x[1])})
print("r4", r4.collect())
print(r2.reduceByKey(add).collect())

## groupBy

In [None]:
data = [1, 2, 3, 5]
intRDD = sc.parallelize(data)
result = intRDD.groupBy(lambda x: x % 2).collect()
sorted([(x, sorted(y)) for (x, y) in result])

## reduceByKey
相同key的数据分发到一起，并进行运算。

In [None]:
data_key = sc.parallelize([('a', 4), ('b', 3), ('c', 2), ('a', 8), ('d', 2),
                           ('b', 1), ('d', 3)], 4)
data_key.reduceByKey(lambda x, y: x + y).collect()

In [None]:
data = ["hello zeropython", "hello 168seo.cn"]

# print(list(data))
r1 = sc.parallelize(data)

r2 = r1.flatMap(lambda x: x.split(" ")).map(lambda x: (x, 1))

print("r2", r2.collect())
r3 = r2.reduceByKey(lambda x, y: x + y)

print("r3", r3.collect())

## sortBykey

In [None]:
sc.setLogLevel("ERROR")
data = [
    "hello zeropython", "hwlldsf world", "168seo.cn", "168seo.cn",
    "hello 168seo.cn"
]

# print(list(data))
r1 = sc.parallelize(data)

r2 = r1.flatMap(lambda x:x.split(" "))\
    .map(lambda y:(y,1))\
    .reduceByKey(lambda x,y:x+y)\
    .sortByKey(lambda x:x[1])
# sortByKey排序根据关键词的值进行排序
# reduceByKey 让[("a",[1,1,1,1])] 转换成 [("a",3)]
print(r2.collect())

# 集合

## distinct

In [None]:
rdd = sc.parallelize([1, 1, 2, 3])
sorted(rdd.distinct().collect())

## join

In [None]:
rd1 = sc.parallelize([('a', 1), ('b', 4), ('c', 10)])
rd2 = sc.parallelize([('a', 4), ('a', 1), ('b', '6'), ('d', 15)])
rd3 = rd1.join(rd2)
rd3.collect()

## leftOuterJoin

In [None]:
rd1 = sc.parallelize([('a', 1), ('b', 4), ('c', 10)])
rd2 = sc.parallelize([('a', 4), ('a', 1), ('b', '6'), ('d', 15)])
rd3 = rd1.leftOuterJoin(rd2)
rd3.collect()

## rightOuterJoin

In [None]:
rd1 = sc.parallelize([('a', 1), ('b', 4), ('c', 10)])
rd2 = sc.parallelize([('a', 4), ('a', 1), ('b', '6'), ('d', 15)])
rd3 = rd1.rightOuterJoin(rd2)
rd3.collect()

## fullOuterJoin

In [None]:
rd1 = sc.parallelize([('a', 1), ('b', 4), ('c', 10)])
rd2 = sc.parallelize([('a', 4), ('a', 1), ('b', '6'), ('d', 15)])
rd3 = rd1.fullOuterJoin(rd2)
rd3.collect()

## count

In [None]:
data_key.count()

## countByKey

In [None]:
data_key.countByKey()

In [None]:
data_key.countByKey().items()

## foreach

In [None]:
def f(x):
    print(x)


data_key.foreach(f)

## randomSplit

In [None]:
intRDD = sc.parallelize([3, 1, 2, 5, 5])
stringRDD = sc.parallelize(['Apple', 'Orange', 'Grape', 'Banana', 'Apple'])
sRDD = intRDD.randomSplit([0.4, 0.6])
print(len(sRDD))
print(sRDD[0].collect())
print(sRDD[1].collect())

## union 合集

In [None]:
rdd = sc.parallelize([1, 1, 2, 3])
rdd.union(rdd).collect()

## intersection 并集

In [None]:
rd1 = sc.parallelize([('a', 1), ('b', 4), ('c', 10)])
rd2 = sc.parallelize([('a', 4), ('a', 1), ('b', '6'), ('d', 15)])
rd3 = rd1.intersection(rd2)
rd3.collect()

## subtract 差集

In [None]:
rd1 = sc.parallelize([('a', 1), ('b', 4), ('c', 10)])
rd2 = sc.parallelize([('a', 4), ('a', 1), ('b', '6'), ('d', 15)])
rd3 = rd1.subtract(rd2)
rd3.collect()

## cartesian 笛卡尔积

In [None]:
rd1 = sc.parallelize([('a', 1), ('b', 4), ('c', 10)])
rd2 = sc.parallelize([('a', 4), ('a', 1), ('b', '6'), ('d', 15)])
rd3 = rd1.cartesian(rd2)
rd3.collect()