# Data Source API

Spark provides a flexible and powerful API for reading from different formats and sources.

Here are some examples but there is support for much more formats and it is possible to create new data sources if needed.

## CSV

### Read

In [19]:
df = spark.read.format("csv") \
  .option("header", "true") \
  .option("mode", "FAILFAST") \
  .load("/work/data/flight-data/csv/2010-summary.csv")
df.show(3)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|    1|
|    United States|            Ireland|  264|
|    United States|              India|   69|
+-----------------+-------------------+-----+
only showing top 3 rows



### Write

In [20]:
df.write.format("csv") \
    .mode("overwrite") \
    .option("sep", "\t") \
    .save("/tmp/my-tsv-file.tsv")

## JSON

In [21]:
### Read

In [22]:
df = spark.read.format("json").option("mode", "FAILFAST") \
  .option("inferSchema", "true") \
  .load("/work/data/flight-data/json/2010-summary.json")
df.show(3)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|    1|
|    United States|            Ireland|  264|
|    United States|              India|   69|
+-----------------+-------------------+-----+
only showing top 3 rows



### Write

In [23]:
df.write.format("json").mode("overwrite").save("/tmp/my-json-file.json")

## Parquet

### Read

In [24]:
# optionally with format('parquet')
df = spark.read.load("/work//data/flight-data/parquet/2010-summary.parquet")
df.show(5)


+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|    1|
|    United States|            Ireland|  264|
|    United States|              India|   69|
|            Egypt|      United States|   24|
|Equatorial Guinea|      United States|    1|
+-----------------+-------------------+-----+
only showing top 5 rows



### Write

In [25]:
df.write.mode("overwrite").save("/tmp/my-parquet-file.parquet")

## SQL

For this example you need to edit the file master/jupyter.sh (in your host machine) and add the following options to the pyspark command:

```--driver-class-path /work/lib/sqlite-jdbc-3.39.3.0.jar --jars /work/lib/sqlite-jdbc-3.39.3.0.jar```

The previous options provide the path of the sqlite driver library.
Once the changes have been applied, re run Jupyter Notebook.

Set data source options:

In [2]:
path = "/work/data/flight-data/jdbc/my-sqlite.db"
url = "jdbc:sqlite:" + path
driver = "org.sqlite.JDBC"
tablename = "flight_info"

Connect to data source

In [5]:
db_dataframe = spark.read.format("jdbc").\
        option("url", url).\
        option("dbtable", tablename).\
        option("driver",  driver).\
        load()
db_dataframe.show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|    1|
|    United States|            Ireland|  264|
|    United States|              India|   69|
|            Egypt|      United States|   24|
|Equatorial Guinea|      United States|    1|
+-----------------+-------------------+-----+
only showing top 5 rows



Let's query the DataFrame

In [13]:
qry_db_dataframe = db_dataframe.select('DEST_COUNTRY_NAME').distinct()
qry_db_dataframe.show(3)

+-----------------+
|DEST_COUNTRY_NAME|
+-----------------+
|         Anguilla|
|           Russia|
|         Paraguay|
+-----------------+
only showing top 3 rows



Wih JDBC/ODBC, Spark tries to push down predicates to the database query level, instead of filtering in the Spark side:

In [14]:
qry_db_dataframe.explain()

== Physical Plan ==
*(2) HashAggregate(keys=[DEST_COUNTRY_NAME#25], functions=[])
+- Exchange hashpartitioning(DEST_COUNTRY_NAME#25, 200), true, [id=#125]
   +- *(1) HashAggregate(keys=[DEST_COUNTRY_NAME#25], functions=[])
      +- *(1) Scan JDBCRelation(flight_info) [numPartitions=1] [DEST_COUNTRY_NAME#25] PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string>




In [None]:
However, explicit database queries can be specified instead of table names at the loading time:

In [15]:
pushdown_qry = """(SELECT DISTINCT(DEST_COUNTRY_NAME) FROM flight_info) AS flight_info"""
db_dataframe = spark.read.format("jdbc").\
        option("url", url).\
        option("dbtable", pushdown_qry).\
        option("driver",  driver).\
        load()