# Use the Display function

There are different ways to view data in a DataFrame. This notebook covers these methods as well as transformations to further refine the data.

**Technical Accomplishments:**
* Introduce the transformations...
  * `limit(..)`
  * `select(..)`
  * `drop(..)`
  * `distinct()`
  * `dropDuplicates(..)`
* Introduce the actions...
  * `show(..)`
  * `display(..)`

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Getting Started

Run the following cell to configure our "classroom."

In [0]:
%run "./Includes/Classroom-Setup"

Prepare the data source.

In [0]:
(source, sasEntity, sasToken) = getAzureDataSource()

spark.conf.set(sasEntity, sasToken)

In [0]:
path = source + "/wikipedia/pagecounts/staging_parquet_en_only_clean/"
files = dbutils.fs.ls(path)
display(files)

path,name,size,modificationTime
wasbs://training@dbtrainwesteurope.blob.core.windows.net/wikipedia/pagecounts/staging_parquet_en_only_clean/_SUCCESS,_SUCCESS,0,1545489046000
wasbs://training@dbtrainwesteurope.blob.core.windows.net/wikipedia/pagecounts/staging_parquet_en_only_clean/_committed_6241970109963426653,_committed_6241970109963426653,760,1545489046000
wasbs://training@dbtrainwesteurope.blob.core.windows.net/wikipedia/pagecounts/staging_parquet_en_only_clean/_started_6241970109963426653,_started_6241970109963426653,0,1545489046000
wasbs://training@dbtrainwesteurope.blob.core.windows.net/wikipedia/pagecounts/staging_parquet_en_only_clean/part-00000-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,part-00000-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,2996913,1545489048000
wasbs://training@dbtrainwesteurope.blob.core.windows.net/wikipedia/pagecounts/staging_parquet_en_only_clean/part-00001-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,part-00001-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,2994285,1545489048000
wasbs://training@dbtrainwesteurope.blob.core.windows.net/wikipedia/pagecounts/staging_parquet_en_only_clean/part-00002-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,part-00002-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,2994196,1545489048000
wasbs://training@dbtrainwesteurope.blob.core.windows.net/wikipedia/pagecounts/staging_parquet_en_only_clean/part-00003-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,part-00003-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,2992431,1545489049000
wasbs://training@dbtrainwesteurope.blob.core.windows.net/wikipedia/pagecounts/staging_parquet_en_only_clean/part-00004-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,part-00004-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,2990093,1545489049000
wasbs://training@dbtrainwesteurope.blob.core.windows.net/wikipedia/pagecounts/staging_parquet_en_only_clean/part-00005-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,part-00005-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,2989931,1545489050000
wasbs://training@dbtrainwesteurope.blob.core.windows.net/wikipedia/pagecounts/staging_parquet_en_only_clean/part-00006-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,part-00006-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,2989314,1545489050000


As we can see from the files listed above, this data is stored in <a href="https://parquet.apache.org" target="_blank">Parquet</a> files which can be read in a single command, the result of which will be a `DataFrame`.

Create the DataFrame. This is the same one we created in the previous two notebooks.

In [0]:
parquetDir = source + "/wikipedia/pagecounts/staging_parquet_en_only_clean/"

In [0]:
pagecountsEnAllDF = (spark  # Our SparkSession & Entry Point
  .read                     # Our DataFrameReader
  .parquet(parquetDir)      # Returns an instance of DataFrame
)
print(pagecountsEnAllDF)    # Python hack to see the data type

DataFrame[project: string, article: string, requests: int, bytes_served: bigint]


##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) show(..)

What we want to look for next is a function that will allow us to print the data to the console.

In the API docs for `DataFrame`/`Dataset` find the docs for the `show(..)` command(s).

In the case of Python, we have one method with two optional parameters.<br/>
In the case of Scala, we have several overloaded methods.<br/>

In either case, the `show(..)` method effectively has two optional parameters:
* **n**: The number of records to print to the console, the default being 20.
* **truncate**: If true, columns wider than 20 characters will be truncated, where the default is true.

Let's take a look at the data in our `DataFrame` with the `show()` command:

In [0]:
pagecountsEnAllDF.show(10)

# True ise, görüntülenen sütunlar belirli bir genişliği aşarlarsa kesilecektir. Default True dur.
# pagecountsEnAllDF.show(5, truncate=False) 

+-------+----------------+--------+------------+
|project|         article|requests|bytes_served|
+-------+----------------+--------+------------+
|     en|  !?Revolution!?|       1|           0|
|     en|   !Ay,_caramba!|       1|           0|
|     en|        !DOCTYPE|       1|           0|
|     en|!Gã!nge_language|       1|           0|
|     en| !Hukwe_language|       1|           0|
|     en|       !Kung_San|       1|           0|
|     en|!O!kung_language|       1|           0|
|     en|   !Ora_language|       1|           0|
|     en|       !T.O.O.H!|       1|           0|
|     en|           !Tre!|       1|           0|
+-------+----------------+--------+------------+
only showing top 10 rows



In [0]:
pagecountsEnAllDF.columns

Out[35]: ['project', 'article', 'requests', 'bytes_served']

In the cell above, change the parameters of the show command to:
* print only the first five records
* disable truncation
* print only the first ten records and disable truncation

**Note:** The function `show(..)` is an **action** which triggers a job.

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) display(..)

The `show(..)` command is part of the core Spark API and simply prints the results to the console.

Our notebooks have a slightly more elegant alternative.

Instead of calling `show(..)` on an existing `DataFrame` we can instead pass our `DataFrame` to the `display(..)` command:

In [0]:
display(pagecountsEnAllDF)

project,article,requests,bytes_served
en,!?Revolution!?,1,0
en,"!Ay,_caramba!",1,0
en,!DOCTYPE,1,0
en,!Gã!nge_language,1,0
en,!Hukwe_language,1,0
en,!Kung_San,1,0
en,!O!kung_language,1,0
en,!Ora_language,1,0
en,!T.O.O.H!,1,0
en,!Tre!,1,0


### show(..) vs display(..)
* `show(..)` is part of core spark - `display(..)` is specific to our notebooks.
* `show(..)` is ugly - `display(..)` is pretty.
* `show(..)` has parameters for truncating both columns and rows - `display(..)` does not.
* `show(..)` is a function of the `DataFrame`/`Dataset` class - `display(..)` works with a number of different objects.
* `display(..)` is more powerful - with it, you can...
  * Download the results as CSV
  * Render line charts, bar chart & other graphs, maps and more.
  * See up to 1000 records at a time.
  
For the most part, the difference between the two is going to come down to preference.

Like `DataFrame.show(..)`, `display(..)` is an **action** which triggers a job.

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) limit(..)

Both `show(..)` and `display(..)` are **actions** that trigger jobs (though in slightly different ways).

If you recall, `show(..)` has a parameter to control how many records are printed but, `display(..)` does not.

We can address that difference with our first transformation, `limit(..)`.

If you look at the API docs, `limit(..)` is described like this:
> Returns a new Dataset by taking the first n rows...

`show(..)`, like many actions, does not return anything. 

On the other hand, transformations like `limit(..)` return a **new** `DataFrame`:

In [0]:
limitedDF = pagecountsEnAllDF.limit(5) # "limit" the number of records to the first 5

limitedDF # Python hack to force printing of the data type

Out[11]: DataFrame[project: string, article: string, requests: int, bytes_served: bigint]

### Nothing Happened
* Notice how "nothing" happened - that is no job was triggered.
* This is because we are simply defining the second step in our transformations.
  0. Read in the parquet file (represented by **pagecountsEnAllDF**).
  0. Limit those records to just the first 5 (represented by **limitedDF**).
* It's not until we induce an action that a job is triggered and the data is processed

We can induce a job by calling either the `show(..)` or the `display(..)` actions:

In [0]:
limitedDF.show(100, False) #show up to 100 records and don't truncate the columns

+-------+----------------+--------+------------+
|project|article         |requests|bytes_served|
+-------+----------------+--------+------------+
|en     |!?Revolution!?  |1       |0           |
|en     |!Ay,_caramba!   |1       |0           |
|en     |!DOCTYPE        |1       |0           |
|en     |!Gã!nge_language|1       |0           |
|en     |!Hukwe_language |1       |0           |
+-------+----------------+--------+------------+



In [0]:
display(limitedDF) # defaults to the first 1000 records

project,article,requests,bytes_served
en,!?Revolution!?,1,0
en,"!Ay,_caramba!",1,0
en,!DOCTYPE,1,0
en,!Gã!nge_language,1,0
en,!Hukwe_language,1,0


##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) select(..)

Let's say, for the sake of argument, that we don't want to look at all the data:

In [0]:
pagecountsEnAllDF.printSchema()

root
 |-- project: string (nullable = true)
 |-- article: string (nullable = true)
 |-- requests: integer (nullable = true)
 |-- bytes_served: long (nullable = true)



For example, it was asserted above that **bytes_served** had nothing but zeros in it and consequently is of no value to us.

If that is the case, we can disregard it by selecting only the three columns that we want:

In [0]:
# Transform the data by selecting only three columns
onlyThreeDF = (pagecountsEnAllDF
  .select("project", "article", "requests") # Our 2nd transformation (4 >> 3 columns)
)
# Now let's take a look at what the schema looks like
onlyThreeDF.printSchema()

root
 |-- project: string (nullable = true)
 |-- article: string (nullable = true)
 |-- requests: integer (nullable = true)



Again, notice how the call to `select(..)` does not trigger a job.

That's because `select(..)` is a transformation. It's just one more step in a long list of transformations.

Let's go ahead and invoke the action `show(..)` and take a look at the result.

In [0]:
# And lastly, show the first five records which should exclude the bytes_served column.
onlyThreeDF.show(5, False)

+-------+----------------+--------+
|project|article         |requests|
+-------+----------------+--------+
|en     |!?Revolution!?  |1       |
|en     |!Ay,_caramba!   |1       |
|en     |!DOCTYPE        |1       |
|en     |!Gã!nge_language|1       |
|en     |!Hukwe_language |1       |
+-------+----------------+--------+
only showing top 5 rows



The `select(..)` command is one of the most powerful and most commonly used transformations. 

We will see plenty of other examples of its usage as we progress.

If you look at the API docs, `select(..)` is described like this:
> Returns a new Dataset by computing the given Column expression for each element.

The "Column expression" referred to there is where the true power of this operation shows up. Again, we will go deeper on these later.

Just like `limit(..)`, `select(..)` 
* does not trigger a job
* returns a new `DataFrame`
* simply defines the next transformation in a sequence of transformations.

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) drop(..)

As a quick side note, you will quickly discover there are a lot of ways to accomplish the same task.

Take the transformation `drop(..)` for example - instead of selecting everything we wanted, `drop(..)` allows us to specify the columns we don't want.

If you look at the API docs, `drop(..)` is described like this:
> Returns a new Dataset with a column dropped.

And we can see that we can produce the same result as the last exercise this way:

In [0]:
pagecountsEnAllDF.display()

# OR
# display(pagecountsEnAllDF)

project,article,requests,bytes_served
en,!?Revolution!?,1,0
en,"!Ay,_caramba!",1,0
en,!DOCTYPE,1,0
en,!Gã!nge_language,1,0
en,!Hukwe_language,1,0
en,!Kung_San,1,0
en,!O!kung_language,1,0
en,!Ora_language,1,0
en,!T.O.O.H!,1,0
en,!Tre!,1,0


In [0]:
# Transform the data by selecting only three columns
droppedDF = (pagecountsEnAllDF
  .drop("bytes_served") # Our second transformation after the initial read (4 columns down to 3)
)
# Now let's take a look at what the schema looks like
droppedDF.printSchema()

root
 |-- project: string (nullable = true)
 |-- article: string (nullable = true)
 |-- requests: integer (nullable = true)



Again, `drop(..)` is just one more transformation - that is no job is triggered.

In [0]:
# And lastly, show the first five records which should exclude the bytes_served column.
droppedDF.show(5, False)

+-------+----------------+--------+
|project|article         |requests|
+-------+----------------+--------+
|en     |!?Revolution!?  |1       |
|en     |!Ay,_caramba!   |1       |
|en     |!DOCTYPE        |1       |
|en     |!Gã!nge_language|1       |
|en     |!Hukwe_language |1       |
+-------+----------------+--------+
only showing top 5 rows



##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) distinct() & dropDuplicates()

These two transformations do the same thing. In fact, they are aliases for one another.
* You can see this by looking at the source code for these two methods
* ```def distinct(): Dataset[T] = dropDuplicates()```
* See <a href="https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala" target="_blank">Dataset.scala</a>

The difference between them has everything to do with the programmer and their perspective.
* The name **distinct** will resonate with developers, analyst and DB admins with a background in SQL.
* The name **dropDuplicates** will resonate with developers that have a background or experience in functional programming.

As you become more familiar with the various APIs, you will see this pattern reassert itself.

The designers of the API are trying to make the API as approachable as possible for multiple target audiences.

If you look at the API docs, both `distinct(..)` and `dropDuplicates(..)` are described like this:
> Returns a new Dataset that contains only the unique rows from this Dataset....

With this transformation, we can now tackle our first business question:

### How many different English Wikimedia projects saw traffic during that hour?

If you recall, our original `DataFrame` has this schema:

In [0]:
pagecountsEnAllDF.printSchema()

root
 |-- project: string (nullable = true)
 |-- article: string (nullable = true)
 |-- requests: integer (nullable = true)
 |-- bytes_served: long (nullable = true)



The transformation `distinct()` is applied to the row as a whole - data in the **project**, **article** and **requests** column will effect this evaluation.

To get the distinct list of projects, and only projects, we need to reduce the number of columns to just the one column, **project**. 

We can do this with the `select(..)` transformation and then we can introduce the `distinct()` transformation.

In [0]:
# Corresponding of unique() of pandas

from pyspark.sql import functions as F
pagecountsEnAllDF.agg(F.collect_set("project")).collect()[0][0]

Out[86]: ['en.m.q',
 'en.zero.s',
 'en.m.voy',
 'en.m.s',
 'en.n',
 'en.m',
 'en.b',
 'en.m.v',
 'en.s',
 'en.q',
 'en.zero',
 'en.voy',
 'en.zero.n',
 'en.v',
 'en.zero.v',
 'en.m.b',
 'en.m.d',
 'en.m.n',
 'en.zero.voy',
 'en.zero.d',
 'en',
 'en.d',
 'en.zero.b',
 'en.zero.q']

Just to reinforce, we have three transformations:
0. Read the data (now represented by `pagecountsEnAllDF`)
0. Select just the one column
0. Reduce the records to a distinct set

No job is triggered until we perform an action like `show(..)`:

In [0]:
pagecountsEnAllDF.select("project").distinct().show(3)  


# OR
# pagecountsEnAllDF.createOrReplaceTempView('df')
# spark.sql("""SELECT distinct project FROM df """).show(3)

+-------+
|project|
+-------+
|     en|
|   en.m|
|   en.d|
+-------+
only showing top 3 rows



You can count those if you like.

But, it would be easier to ask the `DataFrame` for the `count()`:

In [0]:
distinctDF = (pagecountsEnAllDF     # Our original DataFrame from spark.read.parquet(..)
  .select("project")                # Drop all columns except the "project" column
  .distinct()                       # Reduce the set of all records to just the distinct column.
)

total = distinctDF.count()     
print("Distinct Projects: {0:,}".format( total ))

Distinct Projects: 24


In [0]:
# Alternative
from pyspark.sql.functions import col, countDistinct
pagecountsEnAllDF.select(countDistinct("project")).show()

+-----------------------+
|count(DISTINCT project)|
+-----------------------+
|                     24|
+-----------------------+



In [0]:
# Alternative
expression = [countDistinct(c).alias(c) for c in pagecountsEnAllDF.columns]
pagecountsEnAllDF.select(*expression).show()

+-------+-------+--------+------------+
|project|article|requests|bytes_served|
+-------+-------+--------+------------+
|     24|1783138|    1010|           1|
+-------+-------+--------+------------+



##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) dropDuplicates(columns...)

The method `dropDuplicates(..)` has a second variant that accepts one or more columns.
* The distinction is not performed across the entire record unlike `distinct()` or even `dropDuplicates()`.
* The distinction is based only on the specified columns.
* This allows us to keep all the original columns in our `DataFrame`.

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Recap

Our code is spread out over many cells which can make this a little hard to follow.

Let's take a look at the same code in a single cell.

In [0]:
parquetDir = source + "/wikipedia/pagecounts/staging_parquet_en_only_clean/"

In [0]:
pagecountsEnAllDF = (spark       # Our SparkSession & Entry Point
  .read                          # Our DataFrameReader
  .parquet(parquetDir)           # Returns an instance of DataFrame
)
(pagecountsEnAllDF               # Only if we are running multiple queries
  .cache()                       # mark the DataFrame as cachable
  .count()                       # materialize the cache
)
distinctDF = (pagecountsEnAllDF  # Our original DataFrame from spark.read.parquet(..)
  .select("project")             # Drop all columns except the "project" column
  .distinct()                    # Reduce the set of all records to just the distinct column.
)
total = distinctDF.count()     
print("Distinct Projects: {0:,}".format( total ))

Distinct Projects: 24


##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) DataFrames vs SQL & Temporary Views

The `DataFrame`s API is built upon an SQL engine.

As such we can "convert" a `DataFrame` into a temporary view (or table) and then use it in "standard" SQL.

Let's start by creating a temporary view from a previous `DataFrame`.

In [0]:
pagecountsEnAllDF.createOrReplaceTempView("pagecounts")

Now that we have a temporary view (or table) we can start expressing our queries and transformations in SQL:

In [0]:
%sql

SELECT * FROM pagecounts

project,article,requests,bytes_served
en,!?Revolution!?,1,0
en,"!Ay,_caramba!",1,0
en,!DOCTYPE,1,0
en,!Gã!nge_language,1,0
en,!Hukwe_language,1,0
en,!Kung_San,1,0
en,!O!kung_language,1,0
en,!Ora_language,1,0
en,!T.O.O.H!,1,0
en,!Tre!,1,0


And we can just as easily express in SQL the distinct list of projects, and just because we can, we'll sort that list:

In [0]:
%sql

SELECT DISTINCT project FROM pagecounts ORDER BY project

project
en
en.b
en.d
en.m
en.m.b
en.m.d
en.m.n
en.m.q
en.m.s
en.m.v


And converting from SQL back to a `DataFrame` is just as easy:

In [0]:
tableDF = spark.sql("SELECT DISTINCT project FROM pagecounts ORDER BY project")
display(tableDF)

project
en
en.b
en.d
en.m
en.m.b
en.m.d
en.m.n
en.m.q
en.m.s
en.m.v


## Next steps

Start the next lesson, [Exercise: Distinct Articles]($./4.Exercise:%20Distinct%20Articles)