Following are Spark’s core data
sources:
- CSV
- JSON
- Parquet
- ORC
- JDBC/ODBC connections
- Plain-text files
As mentioned, Spark has numerous community-created data sources. Here’s just a small sample:
- Cassandra
- HBase
- MongoDB
- AWS Redshift
- XML, etc.

DataFrameReader.format(...).option("key", "value").schema(...).load()

In [0]:
'''spark.read.format("csv")
.option("mode", "FAILFAST")
.option("inferSchema", "true")
.option("path", "path/to/file(s)")
.schema(someSchema)
.load()'''

.option("mode", "FAILFAST")
- permissive: Sets all fields to null when it encounters a corrupted record and places all corrupted records
in a string column called _corrupt_record
- dropMalformed Drops the row that contains malformed records
- failFast Fails immediately upon encountering malformed records

DataFrameWriter.format(...).option(...).partitionBy(...).bucketBy(...).sortBy(...).save()

format is optional because by default,
Spark will use the parquet format. option, again, allows us to configure how to write out our
given data. PartitionBy, bucketBy, and sortBy work only for file-based data sources; you can
use them to control the specific layout of files at the destination.

dataframe.write.format("csv")
.option("mode", "OVERWRITE")
.option("dateFormat", "yyyy-MM-dd")
.option("path", "path/to/file(s)")
.save()

Save mode Description
- append Appends the output files to the list of files that already exist at that location
- overwrite Will completely overwrite any data that already exists there
- errorIfExists Throws an error and fails the write if data or files already exist at the specified location
- ignore If data or files exist at the location, do nothing with the current DataFrame

The default is errorIfExists. This means that if Spark finds data at the location to which
you’re writing, it will fail the write immediately.

Writing csv file as TSV file

csvFileDF.write.format("csv").mode("overwrite").option("sep", "\t")\
.save("/tmp/my-tsv-file.tsv")

When you list the destination directory, you can see that my-tsv-file is actually a folder with
numerous files within it, This actually reflects the number of partitions in our DataFrame at the time we write it out.

In Spark, when we refer to JSON files, we refer to
line-delimited JSON files. This contrasts with files that have a large JSON object or array per
file.
The line-delimited versus multiline trade-off is controlled by a single option: **multiLine**. When
you set this option to true, you can read an entire file as one json object . Line-delimited JSON is actually a much more
stable format because it allows you to append to a file with a new record (rather than having to
read in an entire file and then write it out)

In [0]:
df_json=spark.read.format("json").option("mode", "FAILFAST")\
.option("inferSchema", "true")\
.load("dbfs:/FileStore/flight_data/2010_summary.json")

In [0]:
df_json.write.format("json").mode("overwrite").save("dbfs:/FileStore/tmp/my-json-file.json")

Writing JSON files is just as simple as reading them, and, as you might expect, the data source
does not matter. Therefore, we can reuse the CSV DataFrame to be the
source for our JSON file.

##Parquet Files
Parquet is an open source column-oriented data store that provides a variety of storage
optimizations, especially for analytics workloads. It provides columnar compression, which
saves storage space and allows for reading individual columns instead of entire files.

We
recommend writing data out to Parquet for long-term storage because reading from a Parquet file
will always be more efficient than JSON or CSV. Another advantage of Parquet is that it
supports complex types. This means that if your column is an array (which would fail with a
CSV file, for example), map, or struct, you’ll still be able to read and write that file without
issue.

Parquet has very few options because it enforces its own schema when storing data. Thus, all you
need to set is the format and you are good to go.

*Note: Even though there are only two options, you can still encounter problems if you’re working with
incompatible Parquet files. Be careful when you write out Parquet files with different versions of
Spark (especially older ones) because this can cause significant headache.*

**Read Format:**

spark.read.format("parquet").load("/data/flight-data/parquet/2010-summary.parquet").show(5)

**Write Format**

csvFile.write.format("parquet").mode("overwrite").save("/tmp/my-parquet-file.parquet")

ORC Files
ORC is a self-describing, type-aware columnar file format designed for Hadoop workloads. It is
optimized for large streaming reads, but with integrated support for finding required rows
quickly.

the fundamental difference is that Parquet is further
optimized for use with Spark, whereas ORC is further optimized for Hive

spark.read.format("orc").load("/data/flight-data/orc/2010-summary.orc").show(5)

csvFile.write.format("orc").mode("overwrite").save("/tmp/my-json-file.orc")

SQL Databases

To read and write from these databases, you need to do two things: include the Java Database
Connectivity (JDBC) driver for you particular database on the spark classpath, and provide the
proper JAR for the driver itself

For example, to be able to read and write from PostgreSQL, you
might run something like this:

./bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9.4.1207.jar

Read Format

driver = "org.sqlite.JDBC"
path = "/data/flight-data/jdbc/my-sqlite.db"
url = "jdbc:sqlite:" + path
tablename = "flight_info"

dbDataFrame = spark.read.format("jdbc").option("url", url).option("dbtable", tablename).option("driver", driver).load()

pgDF = spark.read.format("jdbc")
.option("driver", "org.postgresql.Driver")
.option("url", "jdbc:postgresql://database_server")
.option("dbtable", "schema.tablename")
.option("user", "username").option("password", "my-secret-password").load()

Write Format

newPath = "jdbc:sqlite://tmp/my-sqlite.db"
csvFile.write.jdbc(newPath, tablename, mode="overwrite", properties=props)

- 9,10,15 Thu
- 15,18,19 Fri
- module 1 link2, module 2 task 2 Sat
- capstone Sun

As an example of how you would do
this, suppose that you need to parse some Apache log files to some more structured format, or
perhaps you want to parse some plain text for natural-language processing.

Read Format

spark.read.textFile("/data/flight-data/csv/2010-summary.csv")
.selectExpr("split(value, ',') as rows").show()

Writing Text Files

When you write a text file, you need to be sure to have only one string column; otherwise, the
write will fail:

csvFile.select("DEST_COUNTRY_NAME").write.text("/tmp/simple-text-file.txt")

If you perform some partitioning when performing your write, you can write more columns. However, those columns will manifest as
directories in the folder to which you’re writing out to, instead of columns on every single file

csvFile.limit(10).select("DEST_COUNTRY_NAME", "count")
.write.partitionBy("count").text("/tmp/five-csv-files2py.csv")

##Advanced I/O Concepts
We saw previously that we can control the parallelism of files that we write by controlling the
partitions prior to writing. We can also control specific data layout by controlling two things:
bucketing and partitioning

Certain file formats are fundamentally “splittable.” This can improve speed because it makes it
possible for Spark to avoid reading an entire file, and access only the parts of the file necessary
to satisfy your query.
In conjunction with this is a need to manage compression. Not all compression schemes
are splittable. How you store your data is of immense consequence when it comes to making
your Spark jobs run smoothly. We recommend Parquet with gzip compression.