#### Step1: Firstly, upload the data file in HDFS using the command:

In [1]:
!hdfs dfs -put  ~/training_materials/data/sparkintro.txt /user/cloudera/

This will put the file in the home directory of the user in HDFS  
-hdfs dfs is a command line client to interact with the HDFS API   
-put is used to copy file from local file system to HDFS  
-source is the path of the file in the local file system  
-destination is the path of the file in HDFS

#### Step 2: Review the text file ‘sparkintro.txt’ as we will be using this file further, using the command:

In [2]:
!hdfs dfs -cat sparkintro.txt

Apache Spark is a fast and general engine for large-scale data processing.
Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
Apache Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing.
Write applications quickly in Java, Scala, Python, R.
Spark offers over 80 high-level operators that make it easy to build parallel apps. 
And you can use it interactively from the Scala, Python and R shells.
Combine SQL, streaming, and complex analytics.
Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. 
You can combine these libraries seamlessly in the same application.
Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3.
You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, or on Apache Mesos. 
Access data in HDFS, Cassandra,

-cat is used to display the contents of a file

#### Step 3: Create your first RDD which will load the file in memory

In [3]:
data=sc.textFile("sparkintro.txt")

-sc.textFile("filename") is used to load a file in memory  
Till now, Spark has not yet read the file.  
Spark does not do the work until the action is performed on the RDD.  
You can validate this by specifying a wrong file name here, it will give error only when you take action.


#### Step 4: Count number of elements in the text file

In [4]:
data.count()

14

-count() is an action that returns the number of elements in a RDD

#### Step 5: Split the RDD based on ‘  ‘ delimiter

In [6]:
data_split=data.map(lambda line: line.split('\n'))

This will split the RDD based on new line as a delimeter

#### Step 6: Print the array of all elements 

In [7]:
data_split.collect()

[[u'Apache Spark is a fast and general engine for large-scale data processing.'],
 [u'Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.'],
 [u'Apache Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing.'],
 [u'Write applications quickly in Java, Scala, Python, R.'],
 [u'Spark offers over 80 high-level operators that make it easy to build parallel apps. '],
 [u'And you can use it interactively from the Scala, Python and R shells.'],
 [u'Combine SQL, streaming, and complex analytics.'],
 [u'Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. '],
 [u'You can combine these libraries seamlessly in the same application.'],
 [u'Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3.'],
 [u'You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, 

-collect() is an action that prints the array of all elements in the text file.  
Note: collect operation is not advisable for large datasets, try avoid using it as it may bring down Spark shell

#### Step 7: Print only 3 elements of the RDD

In [8]:
data_split.take(3)

[[u'Apache Spark is a fast and general engine for large-scale data processing.'],
 [u'Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.'],
 [u'Apache Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing.']]

-take(n) is an action that will print the n elements of the RDD

#### Step 8: Filter the elements based upon the elements that contain the word "Spark" and then print all the elements

In [10]:
data_filter=data.filter(lambda words: "Spark" in words)
data_filter.collect()

[u'Apache Spark is a fast and general engine for large-scale data processing.',
 u'Apache Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing.',
 u'Spark offers over 80 high-level operators that make it easy to build parallel apps. ',
 u'Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. ',
 u'Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3.',
 u'You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, or on Apache Mesos. ']

This will filter only those elements that contain the word "Spark"

#### Step 9: Save the elements as a textfile

In [12]:
data_filter.saveAsTextFile("spark_intro_output")

-saveAsTextFile(...) is used to store the elements of an RDD in local file system or HDFS

#### Step 10: You can review the output using the following command:

In [13]:
!hdfs dfs -cat spark_intro_output/part*

Apache Spark is a fast and general engine for large-scale data processing.
Apache Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing.
Spark offers over 80 high-level operators that make it easy to build parallel apps. 
Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. 
Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3.
You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, or on Apache Mesos. 
