# Loading data in PySpark shell

- In PySpark, we express our computation through operations on distributed collections that are automatically parallelized across the cluster. In the previous exercise, you have seen an example of loading a list as parallelized collections and in this exercise, you'll load the data from a local file in PySpark shell.

- Remember you already have a `SparkContext` `sc` and `file_path` variable (which is the path to the README.md file) already available in your workspace.

## Instructions

-Load a local text file README.md in PySpark shell.

In [1]:
# Intialization
import os
import sys

os.environ["SPARK_HOME"] = "/home/talentum/spark"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
# In below two lines, use /usr/bin/python2.7 if you want to use Python 2
os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3.6" 
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.7-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

# NOTE: Whichever package you want mention here.
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0 pyspark-shell' 
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-avro_2.11:2.4.0 pyspark-shell'
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.3 pyspark-shell'
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.0 pyspark-shell'

In [2]:
#Entrypoint 2.x
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().getOrCreate()

# On yarn:
# spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().master("yarn").getOrCreate()
# specify .master("yarn")

sc = spark.sparkContext

In [8]:
file_path = 'file:////home/talentum/spark/README.md'
# Load a local file into PySpark shell
lines = sc.textFile(file_path)
# now data went to memory
print(type(lines))

<class 'pyspark.rdd.RDD'>


In [7]:
lines.count()

104

In [9]:
print(lines.take(5))

['# Apache Spark', '', 'Spark is a fast and general cluster computing system for Big Data. It provides', 'high-level APIs in Scala, Java, Python, and R, and an optimized engine that', 'supports general computation graphs for data analysis. It also supports a']


In [13]:
print(lines.take(lines.count()))

['# Apache Spark', '', 'Spark is a fast and general cluster computing system for Big Data. It provides', 'high-level APIs in Scala, Java, Python, and R, and an optimized engine that', 'supports general computation graphs for data analysis. It also supports a', 'rich set of higher-level tools including Spark SQL for SQL and DataFrames,', 'MLlib for machine learning, GraphX for graph processing,', 'and Spark Streaming for stream processing.', '', '<http://spark.apache.org/>', '', '', '## Online Documentation', '', 'You can find the latest Spark documentation, including a programming', 'guide, on the [project web page](http://spark.apache.org/documentation.html).', 'This README file only contains basic setup instructions.', '', '## Building Spark', '', 'Spark is built using [Apache Maven](http://maven.apache.org/).', 'To build Spark and its example programs, run:', '', '    build/mvn -DskipTests clean package', '', '(You do not need to do this if you downloaded a pre-built package.)', '',

In [14]:
# Convert RDD to a list and join the lines with a delimiter
lines_list = lines.collect()
lines_joined = '.'.join(lines_list)

# Print the joined lines
print(lines_joined)


# Apache Spark..Spark is a fast and general cluster computing system for Big Data. It provides.high-level APIs in Scala, Java, Python, and R, and an optimized engine that.supports general computation graphs for data analysis. It also supports a.rich set of higher-level tools including Spark SQL for SQL and DataFrames,.MLlib for machine learning, GraphX for graph processing,.and Spark Streaming for stream processing...<http://spark.apache.org/>...## Online Documentation..You can find the latest Spark documentation, including a programming.guide, on the [project web page](http://spark.apache.org/documentation.html)..This README file only contains basic setup instructions...## Building Spark..Spark is built using [Apache Maven](http://maven.apache.org/)..To build Spark and its example programs, run:..    build/mvn -DskipTests clean package..(You do not need to do this if you downloaded a pre-built package.)..More detailed documentation is available from the project site, at.["Building Spa