# History of Spark

## grep

## Hadoop
- MapReduce (2004)
- Hadoop @Yahoo (2006)

<img src="https://upload.wikimedia.org/wikipedia/commons/0/0e/Hadoop_logo.svg" width="60%"/>

https://phoenixnap.com/kb/hadoop-vs-spark

## [Spark](http://spark.apache.org/)
- Spark @UC Berkley (2009)
- Open Source Spark, BSD License (2010)
- Developed by AmpLab (2011)
- Databricks maintains open source Spark (2013) => Stability
- License change to Apache Commons 2.0 (2013)
- [Large Scale Sorting record](https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html) (2014)
- Spark 2.x (2016)
- Spark 3.x (2020)

## Languages

### [Scala](https://www.scala-lang.org/)
- Designed by Martin Odersky (2004)
- Developed by EPFL (Switzerland)

### [Java](https://docs.oracle.com/javase/8/docs/technotes/guides/language/index.html)
  - Designed by James Gosling (1995)
  - Developed by Oracle

### [Python](https://www.python.org/)
- Designed by Guido van Rossum (1991)
- Developed by Python Software Foundation

### [R](https://www.r-project.org/)
  - Designed by Ross Ihaka and Robert Gentleman (1993)
  - Developed by R Core Team

### [C#](https://docs.microsoft.com/en-us/dotnet/csharp/)
- Designed by Anders Hejlsberg (2000)
- Developed by Microsoft

### [F#](https://docs.microsoft.com/en-us/dotnet/fsharp/)
- Designed by Don Syme (2005)
- Developed by Microsoft and The F# Software Foundation

### SQL
- Designed by Donald D. Chamberlin and Raymond F. Boyce (1974)
- Developed by 	ISO/IEC

## Libraries

### SQL

### Streaming

### MLlib

### GraphX

# Data

In [1]:
from google.colab import drive
import os
from requests import get

In [2]:
%%html
<iframe src="https://corpus.canterbury.ac.nz/descriptions/" width="800" height="600"></iframe>

In [3]:
drive.mount('/content/gdrive', force_remount=True)
dir = os.path.join('gdrive', 'My Drive', 'Eurostat', '05 - Data Science for Big Data')
data_dir = os.path.join(dir, 'data')

Mounted at /content/gdrive


In [4]:
def download_save(url, filename):
  res = get(url)
  if res.status_code != 200:
    print(f"Couldn't fetch data from {url}")
  else:
    csv_file = open(os.path.join(data_dir, filename), 'wb')
    csv_file.write(res.content)
    csv_file.close()

In [5]:
download_save('http://corpus.canterbury.ac.nz/resources/large.zip',
              'large.zip')

In [6]:
!cd "gdrive/MyDrive/Eurostat/05 - Data Science for Big Data/data" && unzip large.zip

Archive:  large.zip
  inflating: bible.txt               
  inflating: E.coli                  
  inflating: world192.txt            


# Hello Spark

## Pyspark Installation

In [7]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.2.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 44 kB/s 
[?25hCollecting py4j==0.10.9.2
  Downloading py4j-0.10.9.2-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 49.8 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.0-py2.py3-none-any.whl size=281805912 sha256=6d633748519264d78b6626ea01116b3d6bb67444867e95fa1bbd6ea8e399b8da
  Stored in directory: /root/.cache/pip/wheels/0b/de/d2/9be5d59d7331c6c2a7c1b6d1a4f463ce107332b1ecd4e80718
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.2 pyspark-3.2.0


## Spark Context

In [8]:
from pyspark import SparkConf, SparkContext

conf = SparkConf().setAppName("Hello-Spark").setMaster("local[*]")
sc = SparkContext(conf=conf)

In [9]:
sc

In [10]:
file_name = os.path.join(data_dir, 'bible.txt')
bible_file = sc.textFile(file_name)

In [11]:
type(bible_file)

pyspark.rdd.RDD

In [12]:
print(bible_file.__doc__)


    A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.
    Represents an immutable, partitioned collection of elements that can be
    operated on in parallel.
    


In [13]:
bible_file.first()

'In the beginning God created the heaven and the earth. And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters. '

# Word Count

## Map & Reduce


In [14]:
bible_words = bible_file.flatMap(lambda line : line.split(" "))

In [15]:
word_counts = bible_words.map(lambda word : (word, 1))

In [16]:
word_counts = word_counts.reduceByKey(lambda cumulVal, newVal : cumulVal + newVal)

In [17]:
type(bible_words)

pyspark.rdd.PipelinedRDD

## Lazy Evaluation

In [18]:
word_counts.take(4)

[('God', 2186), ('created', 36), ('And', 12163), ('earth', 326)]

In [19]:
word_counts = word_counts.sortBy(lambda kv_pair : kv_pair[1], ascending=False)

In [20]:
word_counts.take(4)

[('the', 59835), ('and', 37322), ('of', 32972), ('', 30383)]

In [21]:
word_counts.saveAsTextFile(os.path.join(data_dir, 'bible_word_count'))

## CountByValue

In [22]:
bible_file.flatMap(lambda line : line.split(" ")).countByValue()

defaultdict(int,
            {'In': 309,
             'the': 59835,
             'beginning': 68,
             'God': 2186,
             'created': 36,
             'heaven': 202,
             'and': 37322,
             'earth.': 191,
             'And': 12163,
             'earth': 326,
             'was': 4249,
             'without': 372,
             'form,': 3,
             'void;': 3,
             'darkness': 61,
             'upon': 2659,
             'face': 265,
             'of': 32972,
             'deep.': 5,
             'Spirit': 118,
             'moved': 45,
             'waters.': 25,
             '': 30383,
             'said,': 1556,
             'Let': 417,
             'there': 1799,
             'be': 6643,
             'light:': 10,
             'light.': 28,
             'saw': 508,
             'light,': 53,
             'that': 12107,
             'it': 4336,
             'good:': 33,
             'divided': 54,
             'light': 160,
             'from': 

In [23]:
sc.stop()