<a href="https://colab.research.google.com/github/fralfaro/tutorial_pyspark/blob/main/notebooks/learning-apache-spark/notebooks/01-data-strcture/1.1-rdd.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.2.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 33 kB/s 
[?25hCollecting py4j==0.10.9.3
  Downloading py4j-0.10.9.3-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 52.6 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.1-py2.py3-none-any.whl size=281853642 sha256=f35048c11054c3cf7e4a0831384aaaf4d1893271bcc14eddd2c2ba5af2a7e63a
  Stored in directory: /root/.cache/pip/wheels/9f/f5/07/7cd8017084dce4e93e84e92efd1e1d5334db05f2e83bcef74f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.3 pyspark-3.2.1


In [19]:
# files
!wget https://raw.githubusercontent.com/fralfaro/tutorial_pyspark/main/notebooks/learning-apache-spark/data/twitter.txt
!wget https://raw.githubusercontent.com/fralfaro/tutorial_pyspark/main/notebooks/learning-apache-spark/data/mtcars.csv

--2022-04-19 19:46:51--  https://raw.githubusercontent.com/fralfaro/tutorial_pyspark/main/notebooks/learning-apache-spark/data/twitter.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1097 (1.1K) [text/plain]
Saving to: ‘twitter.txt’


2022-04-19 19:46:51 (45.5 MB/s) - ‘twitter.txt’ saved [1097/1097]

--2022-04-19 19:46:51--  https://raw.githubusercontent.com/fralfaro/tutorial_pyspark/main/notebooks/learning-apache-spark/data/mtcars.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1694 (1.7K) [text/plain]
Saving to: ‘mtcars.

In [2]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
sc=SparkContext()
spark = SparkSession(sparkContext=sc)

# RDD object

The class `pyspark.SparkContext` creates a client which connects to a Spark cluster. This client can be used to create an RDD object. There are two methods from this class for directly creating RDD objects:
* `parallelize()`
* `textFile()`

## `parallelize()`

`parallelize()` distribute a local **python collection** to form an RDD. Common built-in python collections include `dist`, `list`, `tuple` or `set`.

Examples:

In [4]:
# from a list
rdd = sc.parallelize([1,2,3])
rdd.collect()

[1, 2, 3]

In [5]:
# from a tuple
rdd = sc.parallelize(('cat', 'dog', 'fish'))
rdd.collect()

['cat', 'dog', 'fish']

In [6]:
# from a list of tuple
list_t = [('cat', 'dog', 'fish'), ('orange', 'apple')]
rdd = sc.parallelize(list_t)
rdd.collect()

[('cat', 'dog', 'fish'), ('orange', 'apple')]

In [7]:
# from a set
s = {'cat', 'dog', 'fish', 'cat', 'dog', 'dog'}
rdd = sc.parallelize(s)
rdd.collect()

['fish', 'cat', 'dog']

When it is a `dict`, only the keys are used to form the RDD.

In [8]:
# from a dict
d = {
    'a': 100,
    'b': 200,
    'c': 300
}
rdd = sc.parallelize(d)
rdd.collect()

['a', 'b', 'c']

## `textFile()`

The `textFile()` function reads a text file and returns it as an **RDD of strings**. Usually, you will need to apply some **map** functions to transform each elements of the RDD to some data structure/type that is suitable for data analysis.

**When using `textFile()`, each line of the text file becomes an element in the resulting RDD.**

Examples:

In [20]:
# read a csv file
rdd = sc.textFile('mtcars.csv')
rdd.take(5)

[',mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb',
 'Mazda RX4,21,6,160,110,3.9,2.62,16.46,0,1,4,4',
 'Mazda RX4 Wag,21,6,160,110,3.9,2.875,17.02,0,1,4,4',
 'Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1',
 'Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1']

In [21]:
# read a txt file
rdd = sc.textFile('twitter.txt')
rdd.take(5)

['Fresh install of XP on new computer. Sweet relief! fuck vista\t1018769417\t1.0',
 'Well. Now I know where to go when I want my knives. #ChiChevySXSW http://post.ly/RvDl\t10284216536\t1.0',
 '"Literally six weeks before I can take off ""SSC Chair"" off my email. Its like the torturous 4th mile before everything stops hurting."\t10298589026\t1.0',
 'Mitsubishi i MiEV - Wikipedia, the free encyclopedia - http://goo.gl/xipe Cutest car ever!\t109017669432377344\t1.0',
 "'Cheap Eats in SLP' - http://t.co/4w8gRp7\t109642968603963392\t1.0"]