## Importing data in Spark/Databricks

This notebook provides a brief overview of connecting to data in Databricks. Additionally, you see what example data are provided by Databricks.  The code was adapted from [Databricks Datasets](https://docs.databricks.com/data/databricks-datasets.html), [Databricks data](https://docs.databricks.com/data/data.html), [Databricks File System](https://docs.databricks.com/data/databricks-file-system.html), and [Databricks filestore](https://docs.databricks.com/data/filestore.html).

### Seeing files on the filesystem

The [dbutils package](https://docs.databricks.com/dev-tools/databricks-utils.html) provides utilities for interacting with the file system.  Additionally, the `display()` function is also used heavily in this example.  You can read more about the [display function](https://docs.databricks.com/notebooks/visualizations/index.html).

The next example helps us see what default data comes with our Databricks instance.

In [0]:
display(dbutils.fs.ls("/databricks-datasets"))


path,name,size,modificationTime
dbfs:/databricks-datasets/COVID/,COVID/,0,1652906982266
dbfs:/databricks-datasets/README.md,README.md,976,1532468253000
dbfs:/databricks-datasets/Rdatasets/,Rdatasets/,0,1652906982266
dbfs:/databricks-datasets/SPARK_README.md,SPARK_README.md,3359,1455043490000
dbfs:/databricks-datasets/adult/,adult/,0,1652906982266
dbfs:/databricks-datasets/airlines/,airlines/,0,1652906982266
dbfs:/databricks-datasets/amazon/,amazon/,0,1652906982266
dbfs:/databricks-datasets/asa/,asa/,0,1652906982266
dbfs:/databricks-datasets/atlas_higgs/,atlas_higgs/,0,1652906982266
dbfs:/databricks-datasets/bikeSharing/,bikeSharing/,0,1652906982266


The next line is a handy function to print the readme files from the data provided by Databricks.

In [0]:
with open("/dbfs/databricks-datasets/README.md") as f:
    x = ''.join(f.readlines())

print(x)

In [0]:
with open("/dbfs/databricks-datasets/songs/README.md") as f:
    x = ''.join(f.readlines())

print(x)

They have all the R datasets.

In [0]:
with open("/dbfs/databricks-datasets/Rdatasets/README.md") as f:
    x = ''.join(f.readlines())

print(x)

In [0]:
display(dbutils.fs.ls("/databricks-datasets/Rdatasets/data-001/csv"))

path,name,size,modificationTime
dbfs:/databricks-datasets/Rdatasets/data-001/csv/COUNT/,COUNT/,0,1652906984865
dbfs:/databricks-datasets/Rdatasets/data-001/csv/Ecdat/,Ecdat/,0,1652906984865
dbfs:/databricks-datasets/Rdatasets/data-001/csv/HSAUR/,HSAUR/,0,1652906984865
dbfs:/databricks-datasets/Rdatasets/data-001/csv/HistData/,HistData/,0,1652906984865
dbfs:/databricks-datasets/Rdatasets/data-001/csv/KMsurv/,KMsurv/,0,1652906984865
dbfs:/databricks-datasets/Rdatasets/data-001/csv/MASS/,MASS/,0,1652906984866
dbfs:/databricks-datasets/Rdatasets/data-001/csv/Zelig/,Zelig/,0,1652906984866
dbfs:/databricks-datasets/Rdatasets/data-001/csv/boot/,boot/,0,1652906984866
dbfs:/databricks-datasets/Rdatasets/data-001/csv/car/,car/,0,1652906984866
dbfs:/databricks-datasets/Rdatasets/data-001/csv/cluster/,cluster/,0,1652906984866


In [0]:
display(dbutils.fs.ls("/databricks-datasets/Rdatasets/data-001/csv/ggplot2"))

path,name,size,modificationTime
dbfs:/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv,diamonds.csv,3192560,1416619980000
dbfs:/databricks-datasets/Rdatasets/data-001/csv/ggplot2/economics.csv,economics.csv,20731,1416619980000
dbfs:/databricks-datasets/Rdatasets/data-001/csv/ggplot2/midwest.csv,midwest.csv,100539,1416619980000
dbfs:/databricks-datasets/Rdatasets/data-001/csv/ggplot2/movies.csv,movies.csv,6000709,1416619980000
dbfs:/databricks-datasets/Rdatasets/data-001/csv/ggplot2/mpg.csv,mpg.csv,17345,1416619980000
dbfs:/databricks-datasets/Rdatasets/data-001/csv/ggplot2/msleep.csv,msleep.csv,7182,1416619980000
dbfs:/databricks-datasets/Rdatasets/data-001/csv/ggplot2/presidential.csv,presidential.csv,512,1416619981000
dbfs:/databricks-datasets/Rdatasets/data-001/csv/ggplot2/seals.csv,seals.csv,64016,1416619981000


### Getting data into R

Loading data into R will use the `read.df()` function from the __SparkR__ package. Notice that the visualization from within R uses R graphics.

In [0]:
%r
library(sparklyr)
library(SparkR)
mpg_df <- read.df("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/mpg.csv", "csv", header = "true", inferSchema = "true", na.strings = "NA")
display(mpg_df)

In [0]:
%r
display(mpg_df)

The `mpg_df` object is a Spark object.  If our object is small enough to fit in memory we can use `collect()` to create an in memory

In [0]:
%r
mpg <- SparkR::collect(mpg_df)
display(mpg)

## Loading data into Python

The `pyspark` package uses the `.read` method where the user then needs to call the method for their file type, `.csv()` to import data.

In [0]:
mpg_df = spark.read.csv("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/mpg.csv", header = "true", inferSchema = "true")
display(mpg_df)

_c0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact
6,audi,a4,2.8,1999,6,manual(m5),f,18,26,p,compact
7,audi,a4,3.1,2008,6,auto(av),f,18,27,p,compact
8,audi,a4 quattro,1.8,1999,4,manual(m5),4,18,26,p,compact
9,audi,a4 quattro,1.8,1999,4,auto(l5),4,16,25,p,compact
10,audi,a4 quattro,2.0,2008,4,manual(m6),4,20,28,p,compact


Notices the `.toPandas()` method that allows us to pull the entire Spark distributed object into local memory.

In [0]:
mpg = mpg_df.toPandas()
display(mpg)

## Uploading data and using within Databricks

With `dbutils` we can use the file system, `.fs` methods and then list all files and folders using `.ls()`. Databricks creates a default `FileStore` folder where a user can upload files. Using the the `data` navigation to the left you will see a `DBFS` button that will let you use your browser to upload files.

In [0]:
display(dbutils.fs.ls("/"))


In [0]:
display(dbutils.fs.ls("/FileStore"))


## Spark Databases and SQL

Databricks provides functionality to store your data objects in a Spark database. With our data objects stored in the database we can using `%sql` within a code chunk. If we are using a `SQL` chunck we may want to manipulate data using `SQL` but then get access to that data in a `%python` chunk.

In [0]:
mpg_df = spark.read.csv("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/mpg.csv", header = "true", inferSchema = "true")
mpg_df.createOrReplaceTempView("mpg")


In [0]:
%sql
show databases

In [0]:
%sql
DROP TABLE IF EXISTS mpgsummary;
show tables in default

In [0]:
%sql
SELECT * FROM mpg

In [0]:
%sql
CREATE TEMPORARY VIEW mpgsummary
  AS  
    SELECT COUNT(DISTINCT model) as model, count(DISTINCT year) as year, max(year) as max_year, min(year) as min_year
    FROM mpg
    GROUP BY manufacturer

In [0]:
%sql
SELECT *
FROM mpgsummary

Now that we have the table in our temporary database we can get it back into R or Python to leverage those languages for further analytics.

### Into Python

In [0]:
mpg_summary = spark.sql('select * from mpgsummary').toPandas()
display(mpg_summary)

### Into R

In [0]:
%r
mpg_summary <- SparkR::sql("SELECT * FROM mpgsummary")
display(SparkR::collect(mpg_summary))