![Databricks Logo](http://training.databricks.com/databricks_guide/databricks_logo_400px.png)
# **Introduction to Python Notebooks** 

* This introduction notebook describes how to get started running Python code in Notebooks.
* If you have not already done so, please review the [Welcome to Databricks](/#workspace/databricks_guide/00 Welcome to Databricks) guide.

### **Clone** this notebook
This is a locked notebook so you will need to clone it into your workspace to use it.  
* Click the **Clone** button in the menu bar on the top left. 
* Navigate to the cloned notebook (e.g. click Menu > Workspace > ``Intro Python Notebooks copy``)

![Menu Bar Clone Notebook](http://training.databricks.com/databricks_guide/unlock_nb_clone_menu.png)

### **Attach** the Notebook to a **cluster**
* A **Cluster** is a group of machines which can run commands in cells.
* Check the upper left corner of your notebook to see if it is **Attached** or **Detached**.
* If **Detached**, select a cluster to attach it to. If you don't currently have clusters, create one as described in the [Welcome to Databricks](/#workspace/databricks_guide/00 Welcome to Databricks) guide.

![Attach Notebook](http://training.databricks.com/databricks_guide/cluster_attach2.png)

***
#### ![Quick Note](http://training.databricks.com/databricks_guide/icon_note3_s.png) **Cells** are units that make up notebooks
![A Cell](http://training.databricks.com/databricks_guide/cell.png)

Cells each have a type - either **scala**, **python**, **sql**, **R**, or **markdown**.
* While cells default to the type of the Notebook, other cell types are supported as well.
* For example, Python Notebooks can contain python, sql, or markdown cells but not scala cells.
* This cell is in **markdown** and is used for documentation. [Markdown](http://en.wikipedia.org/wiki/Markdown) is a simple text formatting syntax.
***

### ** Create** and **Run** a New Markdown Cell in this Notebook
* When you mouse between cells, a + sign will pop up in the center that you can click on to create a new cell.

 ![New Cell](http://training.databricks.com/databricks_guide/create_new_cell.png)
* Type **``%md Hello, world!``** into your new cell (**``%md``** indicates the cell is markdown).



* Press **Shift+Enter** when in the cell to **run** it and proceed to the next cell.
  * The cells contents should update.
  
  ![Run cell](http://training.databricks.com/databricks_guide/run_cell.png)

Jules, Welcome to the DB Notebook cloud

Hello, world!

***
#### ![Quick Note](http://training.databricks.com/databricks_guide/icon_note3_s.png) **Markdown Cell Tips**
* To change a non-markdown cell to markdown, add **%md** to very start of the cell.
* After updating the contents of a markdown cell, the cell must be run again to update the formatted contents of a markdown cell.
* Cells are not automatically run each time you open it.
  * Instead, previous results from running a cell are saved and displayed.
* Alternately, press **Ctrl+Enter** when in a cell to **run** it, but not proceed to the next cell.
***

### Run a **Python Cell**
* Run the following python cell.
* Note: There is no need for any special indicator (such as %md) necessary to create a Python cell in a Python notebook.
* Make sure the output date and time updates before moving on.

In [10]:
# This is a Python cell.
import datetime
print "This was last run on: %s" % datetime.datetime.now()
print "Hello Jules, Welcome to the Databricks Cloud"

### Running **Spark**
The variable **sc** allows you to access a Spark Context to run your Spark programs.
* For more information about Spark, please refer to [Spark Overview](https://spark.apache.org/docs/latest/)

In [12]:
words = sc.parallelize(["hello", "world", "goodbye", "hello", "again", "cats", "dogs", "rats", "snakes", "python", "java", "python", "go"])
wordcounts = words.map(lambda s: (s, 1)).reduceByKey(lambda a, b : a + b).collect()
print wordcounts

In [13]:
# Exercise: Calculate the number of unique words in the "words" RDD here.
# (Hint: The answer should be 4.)
unique = words.distinct().collect()
print unique
print len(unique)

In [14]:
# Exercise: Create an RDD of numbers, and use Spark to find the mean.
from operator import add
numbersRDD = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
#not the most efficient way since we're invoking two actions: reduce, and count
#must be an efficient way to do it.
sum = numbersRDD.sum()
count = numbersRDD.count()
print sum, count, float (sum / count)

&nbsp;

### Working with **DataFrames**
* Python can be used to create Spark [DataFrames](http://spark.apache.org/docs/latest/sql-programming-guide.html) - a distributed collection of data organized into named columns.
* DataFrames are created by calling ``createDataFrame`` on an RDD of pyspark.sql.Row objects.

In [16]:
# Reference pyspark.sql.Row object
from pyspark.sql import Row

# Build the array of Row objects
array = [Row(key="a", group="vowels", value=105),
         Row(key="b", group="consonants", value=25),
         Row(key="c", group="consonants", value=30),
         Row(key="d", group="consonants", value=45),
         Row(key="e", group="vowels", value=55)]
# build the array of Row of people objects
people = [Row(key=1, age=32, name="Jules Damji", city="San Francisco", occupation="advocate"),
          Row(key=2,  age=42, name="Arther Dent", city="London", occupation="Hitch-hiker"),
          Row(key=3,  age=2009, name="Zaphod Beeblebrox", city="Milkyway", occupation="Wiseman"),
          Row(key=4,  age=55, name="Ford Prefect", city="Essex", occupation="Realtor"),
          Row(key=5, age=2005, name="Trillian", city="Andromeda", occupation="Space Traveler"),
          Row(key=6, age=1000, name="Slartibarfast", city="M33", occupation="Creator"),
          Row(key=7, age=4500, name="Random Dent", city="X33", occupation="Invisble Hand")
]

# Create RDD using sc.parallelize and then transforms it into a DataFrame
df = sqlContext.createDataFrame(sc.parallelize(array))
dfp = sqlContext.createDataFrame(sc.parallelize(people))

#### **Show this data**
Use the **``display``** command to view a DataFrame in a notebook.

In [18]:
display(df)

***
#### ![Quick Note](http://training.databricks.com/databricks_guide/icon_note3_s.png) **The visualization above is interactive**
* Click on the **Chart Button** ![Chart Button](http://training.databricks.com/databricks_guide/chart_button.png) to toggle the view.
* Try different types of graphs by clicking on the arrow next to the chart button.
* If you have selected a graph, you can click on **Plot Options...** for even more ways to customize the view.
***

In [20]:
# Exercise: Create a DataFrame and display it. 
# Can you use the "Plot Options" to plot the group vs. the sum of the values in that group?
# (Hint: Vowels = 6, and consonants = 9)
display(df)


In [21]:
display(dfp)

***
## ![Quick Note](http://training.databricks.com/databricks_guide/icon_note3_s.png) **Where to Go Next**

We've now covered the basics of a Databricks Python Notebook.  Practice the optional exercises below for more python exercises, look to the following notebooks in other languages, or go to the Databricks Product Overview.
* [Notebook Tutorials in Scala](/#workspace/databricks_guide/01 Intro Notebooks/2 Intro Scala Notebooks)
* [Notebook Tutorials in SQL](/#workspace/databricks_guide/01 Intro Notebooks/3 Intro SQL Notebooks)
* [Notebook Tutorials in R](/#workspace/databricks_guide/01 Intro Notebooks/4 Intro R Notebooks)
* [Databricks Product Overview](/#workspace/databricks_guide/02 Product Overview/00 Product Overview)
***

## **Optional Tasks**

### **Importing Standard Python libraries**
* For other libraries that are not available by default, you can upload other libraries to the Workspace.
* Refer to the **[Libraries](/#workspace/databricks_guide/02 Product Overview/07 Libraries)** guide for more details.

In [25]:
import re
m = re.search('(?<=abc)def', 'abcdef')
m.group(0)

### **Using Spark SQL within a Python Notebook**
You can use execute SQL commands within a python notebook by invoking **``%sql``**.

In [27]:
%sql show databases

#### **Use ``registerTempTable`` on a DataFrame** 
* Register the above DataFrame table (built in python) for SQL queries.
* Temporary tables are not meant to be persistent, i.e. they will not survive cluster restarts.

In [29]:
df.registerTempTable("PythonTempTable")

In [30]:
%sql describe PythonTempTable


In [31]:
dfp.registerTempTable("PeopleTable")

#### **Visualizations and Spark SQL**
* A visualization appears automatically for the output of a **SQL select** statement in notebooks (no need to call ``display``).

In [33]:
%sql select * from PythonTempTable

&nbsp;

#### ** Persist DataFrames into Tables **
Use **``saveAsTable``** to persist tables to be used in other notebooks.
* These table definitions will persist even after cluster restarts.

In [35]:

df.write.saveAsTable("PythonTestTable")

***
![Quick Note](http://training.databricks.com/databricks_guide/icon_note3_s.png) For more information on working with Python and Spark SQL, please refer to [Spark SQL Programming Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html)
***

### ** Display HTML **
Display HTML within your notebook, using the **displayHTML** command.

In [38]:
displayHTML("<h3 style=\"color:blue\">Blue Text</h3>")