findspark: Spark' ı Python Script' lerinde kullanbilmeyi sağlayan yapı.

SparkContext is the entry point to any spark functionality. When we run any Spark application, a driver program starts, which has the main function and your SparkContext gets initiated here. The driver program then runs the operations inside the executors on worker nodes. 

SparkContext uses Py4J to launch a JVM and creates a JavaSparkContext [[Tutorials Points](https://www.tutorialspoint.com/pyspark/pyspark_sparkcontext.htm)]

In [1]:
import findspark
findspark.init()

In [2]:
from pyspark import SparkContext
sc = SparkContext(master="local", appName="firtsapp")
# parametre
# master: It is the URL of the cluster it connects to. (anlamadım)

# Bu hücreyi bir kez daha çalıştırsanız şöyle bir hata alıcaksınız:
# ValueError: Cannot run multiple SparkContexts at once

Dizindeki README.md dosyasının içinde `a` ve `b` karakteri geçen satırların sayısını buluyoruz.

In [3]:
file = sc.textFile("README.md").cache()
rowA = file.filter(lambda row: 'a' in row).count()
rowB =file.filter(lambda row: 'b' in row).count()
print(f"Lines with a: {rowA}, lines with b: {rowB}")

Lines with a: 53, lines with b: 26


##  PySpark RDD

RDD stands for Resilient Distributed Dataset, these are the elements that run and operate on multiple nodes to do parallel processing on a cluster.

RDDs are immutable elements, which means once you create an RDD you cannot change it. RDDs are fault tolerant as well, hence in case of any failure, they recover automatically.

To apply operations on these RDD's, there are two ways −
- Transformation and
- Action

Transformation − These are the operations, which are applied on a RDD to create a new RDD. Filter, groupBy and map are the examples of transformations.

Action − These are the operations that are applied on RDD, which instructs Spark to perform computation and send the result back to the driver.

To apply any operation in PySpark, we need to create a PySpark RDD first. 

In [6]:
arr = [
    "scala", 
    "java", 
    "hadoop", 
    "spark", 
    "akka",
    "spark vs hadoop", 
    "pyspark",
    "pyspark and spark"
]
words = sc.parallelize(arr) # RDD nesnesi oluşturur.

In [8]:
# count(): RDD' de bulunan element' lerin sayısını döndürür.
words.count()

8

In [10]:
# collect(): RDD' deki bütün element' leri döndürür
words.collect()

['scala',
 'java',
 'hadoop',
 'spark',
 'akka',
 'spark vs hadoop',
 'pyspark',
 'pyspark and spark']

In [21]:
# foreach(): foreach fonksiyonuna parametre olarak verilen fonksiyonun şartını sağlayan elementleri dödürür.
def f(x):
    print(x)
words.foreach(f)

In [23]:
# filter(): Şartı sağlayan elemanlardan yeni bir RDD oluşturur
filtered = words.filter(lambda x: 'spark' in x)
filtered.collect()

['spark', 'spark vs hadoop', 'pyspark', 'pyspark and spark']

In [30]:
# map(): RDD' deki her bir element' e aynı işlemi uygulayarak yeni bir RDD döndürür.
words_map = words.map(lambda x: (x, 1))
words_map.collect()

#for i in words_map.collect():
#    print(i[0])
# çalışıyor

[('scala', 1),
 ('java', 1),
 ('hadoop', 1),
 ('spark', 1),
 ('akka', 1),
 ('spark vs hadoop', 1),
 ('pyspark', 1),
 ('pyspark and spark', 1)]

In [49]:
# reduce(): Kendisine parametre olarak verilen işlemi gerçekleştirir.
from operator import add
nums = sc.parallelize([1, 2, 3, 4, 5])
print(nums.reduce(add))
print(words.reduce(add))

#def myAdd(x):
#    filtered_arr = list(filter(lambda i: "spark" in i, x.collect()))
#    return ''.join(filtered_arr)

#words.reduce(myAdd)

15
scalajavahadoopsparkakkaspark vs hadooppysparkpyspark and spark


In [55]:
# join():
x = sc.parallelize([("spark", 1), ("hadoop", 4), ("dene", 4)])
y = sc.parallelize([("spark", 2), ("hadoop", 5), ("naber", 4)])
joined = x.join(y)
joined.collect()

[('hadoop', (4, 5)), ('spark', (1, 2))]