# Tutorial - Spark in Python
This notebook is designed to introduce some basic concepts and help get you familiar with using Spark in Python.   

In this notebook, we will load and explore the mtcars dataset. Specifically, this tutorial covers:

1. Loading data in memory
1. Creating SQLContext
1. Creating Spark DataFrame
1. Group data by columns 
1. Operating on columns
1. Running SQL Queries from a Spark DataFrame


## Loading in a DataFrame
To create a Spark DataFrame we load an external DataFrame, called `mtcars`. This DataFrame includes 32 observations on 11 variables.

[, 1]	mpg	Miles/(US) --> gallon  
[, 2]	cyl	--> Number of cylinders  
[, 3]	disp	--> Displacement (cu.in.)  
[, 4]	hp -->	Gross horsepower  
[, 5]	drat -->	Rear axle ratio  
[, 6]	wt -->	Weight (lb/1000)  
[, 7]	qsec -->	1/4 mile time  
[, 8]	vs -->	V/S  
[, 9]	am -->	Transmission (0 = automatic, 1 = manual)  
[,10]	gear -->	Number of forward gears  
[,11]	carb -->	Number of carburetors  

In [None]:
import pandas as pd
mtcars = pd.read_csv('https://ibm.box.com/shared/static/f1dhhjnzjwxmy2c1ys2whvrgz05d1pui.csv')

In [None]:
mtcars.head()

## Initialize SQLContext
To work with dataframes we need a SQLContext which is created using `SQLContext(sc)`. SQLContext uses SparkContext which has been already created, named `sc`. 

In [None]:
sqlContext = SQLContext(sc)

## Creating Spark DataFrames
With SQLContext and a loaded local DataFrame, we create a Spark DataFrame:

In [None]:
sdf = sqlContext.createDataFrame(mtcars) 
sdf.printSchema()

## Displays the content of the DataFrame 


In [None]:
sdf.show(5)

## Selecting columns

In [None]:
sdf.select('mpg').show(5)

## Filtering Data
Filter the DataFrame to only retain rows with `mpg` less than 18

In [None]:
sdf.filter(sdf['mpg'] < 18).show(5)

## Operating on Columns
SparkR also provides a number of functions that can directly applied to columns for data processing and aggregation. The example below shows the use of basic arithmetic functions to convert lb to metric ton.

In [None]:
sdf.withColumn('wtTon', sdf['wt'] * 0.45).show(6)

In [None]:
sdf.show(6)

## Grouping, Aggregation
Spark DataFrames support a number of commonly used functions to aggregate data after grouping. For example we can compute the average weight of cars by their cylinders as shown below:

In [None]:
sdf.groupby(['cyl'])\
.agg({"wt": "AVG"})\
.show(5)

In [None]:
# We can also sort the output from the aggregation to get the most common cars
car_counts = sdf.groupby(['cyl'])\
.agg({"wt": "count"})\
.sort("count(wt)", ascending=False)\
.show(5)


### Running SQL Queries from Spark DataFrames
A Spark DataFrame can also be registered as a temporary table in Spark SQL and registering a DataFrame as a table allows you to run SQL queries over its data. The `sql` function enables applications to run SQL queries programmatically and returns the result as a DataFrame.



In [None]:
# Register this DataFrame as a table.
sdf.registerTempTable("cars")

# SQL statements can be run by using the sql method
highgearcars = sqlContext.sql("SELECT gear FROM cars WHERE cyl >= 4 AND cyl <= 9")
highgearcars.show(6)
    

NOTE: This tutorial draws heavily from the original 
[Spark Quick Start Guide](http://spark.apache.org/docs/latest/quick-start.html)

## Want to learn more?

### Free courses on [Big Data University](https://bigdatauniversity.com/courses/what-is-spark/?utm_source=tutorial-spark-python&utm_medium=dswb&utm_campaign=bdu):

<a href="https://bigdatauniversity.com/courses/what-is-spark/?utm_source=tutorial-spark-python&utm_medium=dswb&utm_campaign=bdu"><img src = "https://ibm.box.com/shared/static/pxb2n9airrzrfola21zj5ssj7kndcp8m.png"> </a>

<h3>Authors:</h3>
<br>
<a href="https://ca.linkedin.com/in/saeedaghabozorgi">
    <div class="teacher-image" style="    float: left;
        width: 115px;
        height: 115px;
        margin-right: 10px;
        margin-bottom: 10px;
        border: 1px solid #CCC;
        padding: 3px;
        border-radius: 3px;
        text-align: center;"><img class="alignnone wp-image-2258 " src="https://ibm.box.com/shared/static/tyd41rlrnmfrrk78jx521eb73fljwvv0.jpg" alt="Saeed Aghabozorgi" width="178" height="178"/>
    </div>
</a>

<h4>Saeed Aghabozorgi</h4>
<p><a href="https://ca.linkedin.com/in/saeedaghabozorgi">Saeed Aghabozorgi</a>, PhD is a Data Scientist in IBM with a track record of developing enterprise level applications that substantially increases clients' ability to turn data into actionable knowledge. He is a researcher in data mining field and expert in developing advanced analytic methods like machine learning and statistical modelling on large datasets.</p>

<br>

<a href="https://ca.linkedin.com/in/polonglin">
    <div class="teacher-image" style="    float: left;
        width: 115px;
        height: 115px;
        margin-right: 10px;
        margin-bottom: 10px;
        border: 1px solid #CCC;
        padding: 3px;
        border-radius: 3px;
        text-align: center;"><img class="alignnone size-medium wp-image-2177" src="https://ibm.box.com/shared/static/2ygdi03ahcr97df2ofrr6cf8knq4kodd.jpg" alt="Polong Lin" width="300" height="300"/>
    </div>
</a>
<h4>Polong Lin</h4>
<p>
<a href="https://ca.linkedin.com/in/polonglin">Polong Lin</a> is a Data Scientist at IBM in Canada. Under the Emerging Technologies division, Polong is responsible for educating the next generation of data scientists through Big Data University. Polong is a regular speaker in conferences and meetups, and holds a M.Sc. in Cognitive Psychology.</p>

<hr>
Copyright &copy; 2016 [Big Data University](https://bigdatauniversity.com). This notebook and its source code are released under the terms of the [MIT License](https://bigdatauniversity.com/mit-license/).​