# Spark and Python



In [1]:
from pyspark import SparkContext

A SparkContext represents the connection to a Spark cluster, and can be used to create an RDD and broadcast variables on that cluster.



In [2]:
sc = SparkContext()

## Basic Operations

 To write an example text file to read

In [9]:
%%writefile example.txt
first line
second line
third line
fourth line

Overwriting example.txt


### Creating the RDD

This method will read a text file from HDFS, a local file system (available on all
nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings.

In [7]:
textFile = sc.textFile('example.txt')

Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files).

### Actions

We created an RDD using the textFile method and can perform operations on this object, such as counting the rows.


In [8]:
textFile.count()

4

In [10]:
textFile.first()

'first line'

### Transformations

This method will only return elements that satisfy the condition. Let's try looking for lines that contain the word 'second'. In which case, there should only be one line that has that.

In [11]:
secfind = textFile.filter(lambda line: 'second' in line)

In [12]:
# RDD
secfind

PythonRDD[7] at RDD at PythonRDD.scala:43

In [13]:
# Perform action on transformation
secfind.collect()

['second line']

In [14]:
# Perform action on transformation
secfind.count()

1

Notice how the transformations won't display an output and won't be run until an action is called.