# Spark介绍

Spark是一个基于内存计算的开源集群计算系统，由一组功能强大的、高级别的库组成，目前这些库包括**SparkSQL、Spark Streaming、MLlib、GraphX**。Spark Core是一个基本引擎，用于大规模并行和分布式数据处理。Spark引入了弹性分布式数据集（RDD）。

SparkSQL支持通过SQL或者hive查询语言来查询数据。  
Spark Streaming支持对流数据的实时处理，会接受数据将其分成不同的批次，处理后根据批次的结果生成最终的流。  
MLlib是一个机器学习库。  
GraphX是一个图计算库，用来处理图，执行图的并行计算。

# Spark数据操作

## Spark RDD操作

RDD，即弹性分布式数据集，是一个容错的、并行的数据结构，可以让用户显式地将数据存储到磁盘和内存中，并能控制数据的分区。

### 创建操作

RDD的一个重要参数是将数据集划分成分片的数量，对每一个分片，Spark会在集群中运行一个对应的任务，一般情况，Spark会根据当前情况自行设定分片数量。

In [1]:
val data = Array(1,2,3,4,5,6,7,8,9)
val distData = sc.parallelize(data, 3)//创建数据集合
distData.collect

Intitializing Scala interpreter ...

Spark Web UI available at http://caoyuyu:4040
SparkContext available as 'sc' (version = 3.0.1, master = local[*], app id = local-1631500270884)
SparkSession available as 'spark'


data: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9)
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:26
res0: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9)


textFile方法会使用一个文件的地址或hdfs地址，然后读入这个文件建立一个文本行的集合。可读取多个文件，逗号分隔。

In [2]:
val distFile1 = sc.textFile("/D:/Dataset/stu_data/engdata.txt")
distFile1.collect

distFile1: org.apache.spark.rdd.RDD[String] = /D:/Dataset/stu_data/engdata.txt MapPartitionsRDD[2] at textFile at <console>:25
res1: Array[String] = Array(???	???, 1	Once we dreamt that we were strangers. We wake up to find that we were dear to each other., 2	We come nearest to the great when we are great in humility., 3	I love you., 4	"My heart, the bird of the wilderness, has found its sky in your eyes.", 5	It is the tears of the earth that keep her smiles in?bloom., 6	The perfect decks itself in beauty for the love of the Imperfect., 7	"What you are you do not see, what you see is your shadow.", 8	"Like the meeting of the seagulls and the waves we meet and come near.The seagulls fly off, the waves roll away and we depart.")


In [3]:
val distFile2 = sc.textFile("hdfs://caoyuyu:8020/input/hadoop.txt")
distFile2.collect

distFile2: org.apache.spark.rdd.RDD[String] = hdfs://caoyuyu:8020/input/hadoop.txt MapPartitionsRDD[4] at textFile at <console>:25
res2: Array[String] = Array(Hadoop???????, "", hadoop  dfsadmin -safemode leave  #???????, hadoop jar /opt/hadoop-2.7.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar  pi 10 10 #????pi?)


### 转换操作

map是对RDD中每个元素都执行一个指定的函数来产生一个新的RDD，RDD之间的元素是一对一关系。

In [4]:
val rdd1 = sc.parallelize(1 to 9, 3)
val rdd2 = rdd1.map(x => x*2)//映射
rdd2.collect

rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at parallelize at <console>:25
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[6] at map at <console>:26
res3: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18)


In [5]:
val rdd3 = rdd2.filter(x => x > 10)//过滤
rdd3.collect

rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[7] at filter at <console>:26
res4: Array[Int] = Array(12, 14, 16, 18)


In [6]:
val rdd4 = rdd3.flatMap(x => x to 20)//映射为序列，而不是单一的元素，一对多
rdd4.collect

rdd4: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[8] at flatMap at <console>:26
res5: Array[Int] = Array(12, 13, 14, 15, 16, 17, 18, 19, 20, 14, 15, 16, 17, 18, 19, 20, 16, 17, 18, 19, 20, 18, 19, 20)


In [7]:
def myfunc[T](iter: Iterator[T]): Iterator[(T, T)] = {
    var res = List[(T, T)]()
    var pre = iter.next
    while (iter.hasNext){
        val cur = iter.next
        res.::=(pre, cur)
        pre = cur
    }
    res.iterator
}
val rdd5 = rdd1.mapPartitions(myfunc)
rdd5.collect
//最终的RDD是由所有分区经过输入函数处理后的结果合并起来的

myfunc: [T](iter: Iterator[T])Iterator[(T, T)]
rdd5: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[9] at mapPartitions at <console>:36
res6: Array[(Int, Int)] = Array((2,3), (1,2), (5,6), (4,5), (8,9), (7,8))


In [8]:
val a = sc.parallelize(1 to 10000, 3)
a.sample(false, 0.1, 0).count //sample随机抽样，第一个参数是否放回抽样，第二个抽样比例，第三个随机种子

a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at parallelize at <console>:25
res7: Long = 943


In [9]:
val rdd8 = rdd1.union(rdd3)
rdd8.collect //数据合并

rdd8: org.apache.spark.rdd.RDD[Int] = UnionRDD[12] at union at <console>:28
res8: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 14, 16, 18)


In [10]:
val rdd9 = rdd8.intersection(rdd1)
rdd9.collect //数据交集

rdd9: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[18] at intersection at <console>:28
res9: Array[Int] = Array(6, 1, 7, 8, 2, 3, 9, 4, 5)


In [11]:
val rdd10 = rdd8.union(rdd9).distinct
rdd10.collect //数据去重

rdd10: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[22] at distinct at <console>:28
res10: Array[Int] = Array(12, 1, 14, 2, 3, 4, 16, 5, 6, 18, 7, 8, 9)


In [12]:
val rdd0 = sc.parallelize(Array((1, 1), (1, 2), (1, 3), (2, 1), (2, 2), (2, 3)), 3)
val rdd11 = rdd0.groupByKey() //数据分组
rdd11.collect

rdd0: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[23] at parallelize at <console>:25
rdd11: org.apache.spark.rdd.RDD[(Int, Iterable[Int])] = ShuffledRDD[24] at groupByKey at <console>:26
res11: Array[(Int, Iterable[Int])] = Array((1,CompactBuffer(1, 2, 3)), (2,CompactBuffer(1, 2, 3)))


In [15]:
val rdd12 = rdd0.reduceByKey((x, y) => x+y)  //数据聚合
rdd12.collect

rdd12: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[26] at reduceByKey at <console>:28
res14: Array[(Int, Int)] = Array((1,6), (2,6))


In [16]:
val rdd14 = rdd0.sortByKey()
rdd14.collect //默认升序

rdd14: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[29] at sortByKey at <console>:26
res15: Array[(Int, Int)] = Array((1,1), (1,2), (1,3), (2,1), (2,2), (2,3))


In [17]:
val rdd15 = rdd0.join(rdd0)
rdd15.collect

rdd15: org.apache.spark.rdd.RDD[(Int, (Int, Int))] = MapPartitionsRDD[32] at join at <console>:26
res16: Array[(Int, (Int, Int))] = Array((1,(1,1)), (1,(1,2)), (1,(1,3)), (1,(2,1)), (1,(2,2)), (1,(2,3)), (1,(3,1)), (1,(3,2)), (1,(3,3)), (2,(1,1)), (2,(1,2)), (2,(1,3)), (2,(2,1)), (2,(2,2)), (2,(2,3)), (2,(3,1)), (2,(3,2)), (2,(3,3)))


In [6]:
val rdd1 = sc.parallelize(1 to 9, 3)
val rdd2 = rdd1.randomSplit(Array(0.3, 0.7), 1)//按权重分组，第一个参数为权重，第二个为随即种子
rdd2(0).collect

rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[7] at parallelize at <console>:28
rdd2: Array[org.apache.spark.rdd.RDD[Int]] = Array(MapPartitionsRDD[8] at randomSplit at <console>:29, MapPartitionsRDD[9] at randomSplit at <console>:29)
res5: Array[Int] = Array(3, 5, 7)


In [7]:
rdd2(1).collect

res6: Array[Int] = Array(1, 2, 4, 6, 8, 9)


In [9]:
val rdd1 = sc.parallelize(1 to 9, 3)
val rdd2 = sc.parallelize(1 to 3, 3)
val rdd3 = rdd1.subtract(rdd2)//减法，将输入的元素rdd1减去rdd2中包含的元素
rdd3.collect

rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[16] at parallelize at <console>:29
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[17] at parallelize at <console>:30
rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[21] at subtract at <console>:31
res8: Array[Int] = Array(6, 9, 4, 7, 5, 8)


In [12]:
val rdd1 = sc.parallelize(1 to 4, 3)
val rdd2 = sc.parallelize(Array("a", "b", "c", "d"), 3)
val rdd3 = rdd1.zip(rdd2)//拉链操作
rdd3.collect

rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[28] at parallelize at <console>:29
rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[29] at parallelize at <console>:30
rdd3: org.apache.spark.rdd.RDD[(Int, String)] = ZippedPartitionsRDD2[30] at zip at <console>:31
res10: Array[(Int, String)] = Array((1,a), (2,b), (3,c), (4,d))


### 行动操作

In [13]:
val rdd1 = sc.parallelize(1 to 9, 3)
val rdd2 = rdd1.reduce(_ + _)//对所有元素执行聚集函数

rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[31] at parallelize at <console>:27
rdd2: Int = 45


In [14]:
rdd1.collect()//将数据集中的所有元素以一个array的形式返回

res11: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9)


In [16]:
rdd1.count() //返回数据集中的元素个数

res13: Long = 9


In [17]:
rdd1.first() //返回数据集第一个元素

res14: Int = 1


In [18]:
rdd1.take(3) //返回一个包含数据集中前n个元素的数组

res15: Array[Int] = Array(1, 2, 3)


In [20]:
rdd1.takeOrdered(4) //返回包含随机的n个元素的数组

res17: Array[Int] = Array(1, 2, 3, 4)


In [None]:
//foreach(func)是对数据集中每个元素都执行func函数