[Spark API Roadmap / Cheatsheet](https://zgul.de/teaching/spark-api)

In [2]:
#pyspark library
import pyspark

#creating spark object (necessary to creating spark dfs)
spark = pyspark.sql.SparkSession.builder.getOrCreate()

In [3]:
#fancy nice html representation
spark

In [4]:
import pandas as pd
import pydataset

In [5]:
#loading tips dataset from pydataset
tips = pydataset.data('tips')

#spark doesn't load any data until it has to
df = spark.createDataFrame(tips)
df

DataFrame[total_bill: double, tip: double, sex: string, smoker: string, day: string, time: string, size: bigint]

In [6]:
#load data w/ .show()
df.show()

+----------+----+------+------+---+------+----+
|total_bill| tip|   sex|smoker|day|  time|size|
+----------+----+------+------+---+------+----+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|
|     25.29|4.71|  Male|    No|Sun|Dinner|   4|
|      8.77| 2.0|  Male|    No|Sun|Dinner|   2|
|     26.88|3.12|  Male|    No|Sun|Dinner|   4|
|     15.04|1.96|  Male|    No|Sun|Dinner|   2|
|     14.78|3.23|  Male|    No|Sun|Dinner|   2|
|     10.27|1.71|  Male|    No|Sun|Dinner|   2|
|     35.26| 5.0|Female|    No|Sun|Dinner|   4|
|     15.42|1.57|  Male|    No|Sun|Dinner|   2|
|     18.43| 3.0|  Male|    No|Sun|Dinner|   4|
|     14.83|3.02|Female|    No|Sun|Dinner|   2|
|     21.58|3.92|  Male|    No|Sun|Dinner|   2|
|     10.33|1.67|Female|    No|Sun|Dinner|   3|
|     16.29|3.71|  Male|    No|Sun|Dinne

In [8]:
#common mistake
df2 = df.show(10)
df2

+----------+----+------+------+---+------+----+
|total_bill| tip|   sex|smoker|day|  time|size|
+----------+----+------+------+---+------+----+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|
|     25.29|4.71|  Male|    No|Sun|Dinner|   4|
|      8.77| 2.0|  Male|    No|Sun|Dinner|   2|
|     26.88|3.12|  Male|    No|Sun|Dinner|   4|
|     15.04|1.96|  Male|    No|Sun|Dinner|   2|
|     14.78|3.23|  Male|    No|Sun|Dinner|   2|
+----------+----+------+------+---+------+----+
only showing top 10 rows



In [9]:
#.show() is like a print statement, it doesn't return anything
# Don't reassign to a variable
#.show() is just to view contents
type(df2)

NoneType

In [10]:
#work with values in the df
df.head(5)

[Row(total_bill=16.99, tip=1.01, sex='Female', smoker='No', day='Sun', time='Dinner', size=2),
 Row(total_bill=10.34, tip=1.66, sex='Male', smoker='No', day='Sun', time='Dinner', size=3),
 Row(total_bill=21.01, tip=3.5, sex='Male', smoker='No', day='Sun', time='Dinner', size=3),
 Row(total_bill=23.68, tip=3.31, sex='Male', smoker='No', day='Sun', time='Dinner', size=2),
 Row(total_bill=24.59, tip=3.61, sex='Female', smoker='No', day='Sun', time='Dinner', size=4)]

In [12]:
#show row contents
df.head(5)[0]

Row(total_bill=16.99, tip=1.01, sex='Female', smoker='No', day='Sun', time='Dinner', size=2)

In [13]:
#show value for tip
df.head(5)[0].tip

1.01

In [15]:
#pull specific columns w/ .select and pass col names you want
df.select('total_bill', 'tip', 'size', 'day')
#specified a transformation, and not an action (that is why we don't see anything b/c you need .show())

DataFrame[total_bill: double, tip: double, size: bigint, day: string]

In [16]:
#show values
df.select('total_bill', 'tip', 'size', 'day').show()

+----------+----+----+---+
|total_bill| tip|size|day|
+----------+----+----+---+
|     16.99|1.01|   2|Sun|
|     10.34|1.66|   3|Sun|
|     21.01| 3.5|   3|Sun|
|     23.68|3.31|   2|Sun|
|     24.59|3.61|   4|Sun|
|     25.29|4.71|   4|Sun|
|      8.77| 2.0|   2|Sun|
|     26.88|3.12|   4|Sun|
|     15.04|1.96|   2|Sun|
|     14.78|3.23|   2|Sun|
|     10.27|1.71|   2|Sun|
|     35.26| 5.0|   4|Sun|
|     15.42|1.57|   2|Sun|
|     18.43| 3.0|   4|Sun|
|     14.83|3.02|   2|Sun|
|     21.58|3.92|   2|Sun|
|     10.33|1.67|   3|Sun|
|     16.29|3.71|   3|Sun|
|     16.97| 3.5|   3|Sun|
|     20.65|3.35|   3|Sat|
+----------+----+----+---+
only showing top 20 rows



In [17]:
#like sql, get back every col
df.select('*')

DataFrame[total_bill: double, tip: double, sex: string, smoker: string, day: string, time: string, size: bigint]

In [18]:
#reference col name (spark col object), but no data
df.tip

Column<'tip'>

In [19]:
#tip percentage
df.tip / df.total_bill
#syntax looks like pandas, but result is different

Column<'(tip / total_bill)'>

In [20]:
#specify w/ .select () and .show() to show results
df.select(df.tip / df.total_bill).show()

+-------------------+
| (tip / total_bill)|
+-------------------+
|0.05944673337257211|
|0.16054158607350097|
|0.16658733936220846|
| 0.1397804054054054|
|0.14680764538430255|
|0.18623962040332148|
|0.22805017103762829|
|0.11607142857142858|
|0.13031914893617022|
| 0.2185385656292287|
| 0.1665043816942551|
|0.14180374361883155|
|0.10181582360570687|
|0.16277807921866522|
|0.20364126770060686|
|0.18164967562557924|
| 0.1616650532429816|
|0.22774708410067526|
|0.20624631703005306|
|0.16222760290556903|
+-------------------+
only showing top 20 rows



In [21]:
#take expressions and store in a new col
col = df.tip / df.total_bill
col
#holds expression that represents new col, but no data

Column<'(tip / total_bill)'>

In [23]:
df.select("*", col.alias('tip_pct')).show(5)

+----------+----+------+------+---+------+----+-------------------+
|total_bill| tip|   sex|smoker|day|  time|size|            tip_pct|
+----------+----+------+------+---+------+----+-------------------+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|0.05944673337257211|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|0.16054158607350097|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|0.16658733936220846|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2| 0.1397804054054054|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|0.14680764538430255|
+----------+----+------+------+---+------+----+-------------------+
only showing top 5 rows



# Spark API Mini Exercises
Copy the code below to create a pandas dataframe with 20 rows and 3 columns:

`import pandas as pd
import numpy as np`

`np.random.seed(13)`

`pandas_dataframe = pd.DataFrame({
    "n": np.random.randn(20),
    "group": np.random.choice(list("xyz"), 20),
    "abool": np.random.choice([True, False], 20),
})`

### 1. Spark Dataframe Basics

#### i. Use the starter code above to create a pandas dataframe.

#### ii. Convert the pandas dataframe to a spark dataframe. From this point forward, do all of your work with the spark dataframe, not the pandas dataframe.

#### iii. Show the first 3 rows of the dataframe.

#### iv. Show the first 7 rows of the dataframe.

#### v. View a summary of the data using `.describe`.

#### vi. Use `.select` to create a new dataframe with just the `n` and `abool` columns. View the first 5 rows of this dataframe.

#### vii. Use `.select` to create a new dataframe with just the `group` and `abool` columns. View the first 5 rows of this dataframe.

#### viii. Use `.select` to create a new dataframe with the `group` column and the `abool` column renamed to `a_boolean_value`. Show the first 3 rows of this dataframe.

#### ix. Use `.select` to create a new dataframe with the `group` column and the `n` column renamed to `a_numeric_value`. Show the first 6 rows of this dataframe.

### 2. Column Manipulation

#### i. Use the starter code above to re-create a spark dataframe. Store the spark dataframe in a variable named `df`. 

#### ii. Use `.select` to add 4 to the `n` column. Show the results.

#### iii. Subtract 5 from the `n` column and view the results.

#### iv. Multiply the `n` column by 2. View the results along with the original numbers.

#### v. Add a new column named `n2` that is the `n` value multiplied by -1. Show the first 4 rows of your dataframe. You should see the original `n` value as well as `n2`.

#### vi. Add a new column named `n3` that is the n value squared. Show the first 5 rows of your dataframe. You should see both `n`, `n2`, and `n3`.

#### vii. What happens when you run the code below?

`df.group + df.abool`

#### viii. What happens when you run the code below? What is the difference between this and the previous code sample?

`df.select(df.group + df.abool)`

#### ix. Try adding various other columns together. What are the results of combining the different data types?