----------
## Notebook setup

If this notebook is using the "PySpark" kernel and you have setup Livy using SSH, you can now access the cluster.

Everytime you run a cell, your web browser window title will show a **(Busy)** status along with the notebook title. You will also see a solid circle next to the **PySpark** text in the top-right corner. After the job completes, this will change to a hollow circle.

In [2]:
print("Running a simple command to start connection to spark")

The code failed because of a fatal error:
	Error sending http request and maximum retry encountered..

Some things to try:
a) Make sure Spark has enough available resources for Jupyter to create a Spark context.
b) Contact your Jupyter administrator to make sure the Spark magics library is configured correctly.
c) Restart the kernel.


### Session information (%%info)

Livy is an open source REST server for Spark. When you execute a code cell in a PySpark notebook, it creates a Livy session to execute your code. You can use the `%%info` magic to display the current Livy session information. Magic commands start with %%

In [4]:
%%info

The code failed because of a fatal error:
	Failed to register auto viz for notebook.
Exception details:
	"cannot import name 'DataError' from 'pandas.core.groupby' (/opt/homebrew/anaconda3/envs/sparkmagicEnv/lib/python3.8/site-packages/pandas/core/groupby/__init__.py)".

Some things to try:
a) Make sure Spark has enough available resources for Jupyter to create a Spark context.
b) Contact your Jupyter administrator to make sure the Spark magics library is configured correctly.
c) Restart the kernel.


Showing all avaliable "sparkmagic" commands:

In [5]:
%%help

The code failed because of a fatal error:
	Failed to register auto viz for notebook.
Exception details:
	"cannot import name 'DataError' from 'pandas.core.groupby' (/opt/homebrew/anaconda3/envs/sparkmagicEnv/lib/python3.8/site-packages/pandas/core/groupby/__init__.py)".

Some things to try:
a) Make sure Spark has enough available resources for Jupyter to create a Spark context.
b) Contact your Jupyter administrator to make sure the Spark magics library is configured correctly.
c) Restart the kernel.


----------
## PySpark magics 

The PySpark kernel provides some predefined “magics”, which are special commands that you can call with `%%` (e.g. `%%MAGIC` <args>). The magic command must be the first word in a code cell and allow for multiple lines of content. You can’t put comments before a cell magic.

For more information on magics, see [here](http://ipython.readthedocs.org/en/stable/interactive/magics.html).

SparkSession available as the variable called 'spark'

In [6]:
spark

The code failed because of a fatal error:
	Failed to register auto viz for notebook.
Exception details:
	"cannot import name 'DataError' from 'pandas.core.groupby' (/opt/homebrew/anaconda3/envs/sparkmagicEnv/lib/python3.8/site-packages/pandas/core/groupby/__init__.py)".

Some things to try:
a) Make sure Spark has enough available resources for Jupyter to create a Spark context.
b) Contact your Jupyter administrator to make sure the Spark magics library is configured correctly.
c) Restart the kernel.


Let's investigate its type:

In [None]:
type(spark)

Normal python code can be executed here

In [None]:
for i in range(2):
    print(i)

We can also import packages

In [None]:
import pandas as pd

Let's run a spark function using the spark variable. Use %%pretty magic command to show the dataframe nicely.

In [None]:
%%pretty
# Read the csv-file from HDFS
df = spark.read\
    .option("header",True)\
    .csv("/datasets/retail/retail.csv")

# Show the first 20 rows as text
df.show()

Note this dataframe is not a pandas dataframe!

In [None]:
type(df)

How many rows do we have access to?

In [None]:
print("{:,.0f} rows in df".format( df.count() ) )

Let's save resources by only working on a subset (10%) of the dataframe, while we develop our code. This makes calculations faster for you, and everyone else using the cluster.
When we are sure our code works as intended, we can delete this code. We use: 

``DataFrame.sample(withReplacement=None, fraction=None, seed=None)``
See https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.DataFrame.sample.html

In [None]:
df = df.sample(withReplacement=False, fraction=1/10)

print("{:,.0f} rows in df".format( df.count() ) )

### SQL magic (%%sql)

The PySpark kernel supports easy inline SparkSQL queries against the `sqlContext`, which is needed for some part of the assignment. So if you are comfortable with SQL, you can create a temporary view on DataFrame/Dataset by using createOrReplaceTempView() and using SQL to select and manipulate the data.

In [None]:
df.createOrReplaceTempView("invoices")

In [None]:
%%sql

SELECT InvoiceNo, SUM(Quantity) AS total_quantity 
FROM invoices 
GROUP BY InvoiceNo 
ORDER BY total_quantity DESC 
LIMIT 20

In [None]:
%%sql

SELECT DISTINCT(Country) 
FROM invoices

Using pyspark, answer the following questions (but first see notes below):  
1. What is the average UnitPrice in retail.csv?  
1. What is the data type of each column (schema) in this file?  
1. How to find the number of unique countries using DataFrames instead of SparkSQL?
1. How many invoices are from Japan?

In [None]:
# Your code here.

----
### Optional: Making code run faster: Session configuration (%%configure)

**NOTE: Many students use the clusters resources. Please be aware of how many resources you use, to allow enough for all!**

All students can view which applications are running on the cluster, and how many resources people use at the <a href="http://130.226.142.166:8088/cluster/scheduler" target="_blank"> cluster scheduler overview</a>. When connecting through "Livy" you cannot see the username of the student.

Use the `%%configure` magic to configure new or existing Livy sessions.
* If a session is already running, you can change the configuration by using the `-f` argument with `%%configure` magic. This will delete the current session and recreate it with the applied configurations. If you don't provide the `-f` argument, an error will be displayed and no configuration changes will be applied.
* If you haven't already started the session, then the `-f` argument is not mandatory. Even if you use it with a session that you are just creating, it will not delete any currently running sessions.




These are some session attributes that can be used for configuration 
- **"name"**: Name of the application
- **"driverMemory"**: Memory for driver (e.g. 1000M, 2G) 
- **"executorMemory"**: Memory for executor (e.g. 1000M, 2G) 

For more attributes for session configuration see <a href="https://github.com/cloudera/livy/tree/6fe1e80cfc72327c28107e0de20c818c1f13e027#post-sessions" target="_blank"> the Livy documentation</a>.

> **TIP**: The application name should start with `remotesparkmagics` to allow sessions to get automatically cleaned up if an error happened. If you provide a name that does not start with `remotesparkmagics` it will not result in an error but the cleanup won't occur.


By default the PySpark shell will be allocated a modest amount of resources on the cluster, but you can specify this through the  options.

\begin{lstlisting}
--name NAME                 A name of your application.
--conf PROP=VALUE           Arbitrary Spark configuration property.
--driver-memory MEM         Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
--executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).
--driver-cores NUM          Number of cores used by the driver, only in cluster mode (Default: 1).
--executor-cores NUM        Number of cores per executor. (Default: 1)
--num-executors NUM         Number of executors to launch (Default: 2).
\end{lstlisting}

For example if you want a pyspark session with 4 executors, with 4 gigabytes of memory each, you would write:

In [None]:
%%configure -f 
{"name":"remotesparkmagics-sample", "executorMemory": "4G", "numExecutors":4}

Read more about sparkmagic here in the original notebook:
https://github.com/jupyter-incubator/sparkmagic/blob/master/examples/Pyspark%20Kernel.ipynb