[Spark API Roadmap / Cheatsheet](https://zgul.de/teaching/spark-api)

In [1]:
#pyspark library
import pyspark

#creating spark object (necessary to creating spark dfs)
spark = pyspark.sql.SparkSession.builder.getOrCreate() #once per notebook

In [2]:
#fancy nice html representation
spark

In [3]:
import pandas as pd
import pydataset

## DataFrame Basics

In [4]:
#loading tips dataset from pydataset
tips = pydataset.data('tips')

#spark doesn't load any data until it has to
df = spark.createDataFrame(tips)
df

DataFrame[total_bill: double, tip: double, sex: string, smoker: string, day: string, time: string, size: bigint]

In [5]:
#load data w/ .show()
df.show()

+----------+----+------+------+---+------+----+
|total_bill| tip|   sex|smoker|day|  time|size|
+----------+----+------+------+---+------+----+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|
|     25.29|4.71|  Male|    No|Sun|Dinner|   4|
|      8.77| 2.0|  Male|    No|Sun|Dinner|   2|
|     26.88|3.12|  Male|    No|Sun|Dinner|   4|
|     15.04|1.96|  Male|    No|Sun|Dinner|   2|
|     14.78|3.23|  Male|    No|Sun|Dinner|   2|
|     10.27|1.71|  Male|    No|Sun|Dinner|   2|
|     35.26| 5.0|Female|    No|Sun|Dinner|   4|
|     15.42|1.57|  Male|    No|Sun|Dinner|   2|
|     18.43| 3.0|  Male|    No|Sun|Dinner|   4|
|     14.83|3.02|Female|    No|Sun|Dinner|   2|
|     21.58|3.92|  Male|    No|Sun|Dinner|   2|
|     10.33|1.67|Female|    No|Sun|Dinner|   3|
|     16.29|3.71|  Male|    No|Sun|Dinne

In [6]:
#pass a number in
df.show(10)

+----------+----+------+------+---+------+----+
|total_bill| tip|   sex|smoker|day|  time|size|
+----------+----+------+------+---+------+----+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|
|     25.29|4.71|  Male|    No|Sun|Dinner|   4|
|      8.77| 2.0|  Male|    No|Sun|Dinner|   2|
|     26.88|3.12|  Male|    No|Sun|Dinner|   4|
|     15.04|1.96|  Male|    No|Sun|Dinner|   2|
|     14.78|3.23|  Male|    No|Sun|Dinner|   2|
+----------+----+------+------+---+------+----+
only showing top 10 rows



In [7]:
#common mistake
df2 = df.show(10)

+----------+----+------+------+---+------+----+
|total_bill| tip|   sex|smoker|day|  time|size|
+----------+----+------+------+---+------+----+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|
|     25.29|4.71|  Male|    No|Sun|Dinner|   4|
|      8.77| 2.0|  Male|    No|Sun|Dinner|   2|
|     26.88|3.12|  Male|    No|Sun|Dinner|   4|
|     15.04|1.96|  Male|    No|Sun|Dinner|   2|
|     14.78|3.23|  Male|    No|Sun|Dinner|   2|
+----------+----+------+------+---+------+----+
only showing top 10 rows



In [8]:
#reassigning gives back nothing
df2

In [9]:
#.show() is like a print statement, it doesn't return anything
# Don't reassign to a variable
#.show() is just to view contents
type(df2)

NoneType

In [10]:
#work with values in the df
df.head(5)

[Row(total_bill=16.99, tip=1.01, sex='Female', smoker='No', day='Sun', time='Dinner', size=2),
 Row(total_bill=10.34, tip=1.66, sex='Male', smoker='No', day='Sun', time='Dinner', size=3),
 Row(total_bill=21.01, tip=3.5, sex='Male', smoker='No', day='Sun', time='Dinner', size=3),
 Row(total_bill=23.68, tip=3.31, sex='Male', smoker='No', day='Sun', time='Dinner', size=2),
 Row(total_bill=24.59, tip=3.61, sex='Female', smoker='No', day='Sun', time='Dinner', size=4)]

In [11]:
#show row contents
df.head(5)[0]

Row(total_bill=16.99, tip=1.01, sex='Female', smoker='No', day='Sun', time='Dinner', size=2)

In [12]:
#show value for tip
df.head(5)[0].tip

1.01

In [13]:
#pull specific columns w/ .select and pass col names you want
df.select('total_bill', 'tip', 'size', 'day')
#specified a transformation, and not an action (that is why we don't see anything b/c you need .show())

DataFrame[total_bill: double, tip: double, size: bigint, day: string]

In [14]:
#show values
df.select('total_bill', 'tip', 'size', 'day').show()

+----------+----+----+---+
|total_bill| tip|size|day|
+----------+----+----+---+
|     16.99|1.01|   2|Sun|
|     10.34|1.66|   3|Sun|
|     21.01| 3.5|   3|Sun|
|     23.68|3.31|   2|Sun|
|     24.59|3.61|   4|Sun|
|     25.29|4.71|   4|Sun|
|      8.77| 2.0|   2|Sun|
|     26.88|3.12|   4|Sun|
|     15.04|1.96|   2|Sun|
|     14.78|3.23|   2|Sun|
|     10.27|1.71|   2|Sun|
|     35.26| 5.0|   4|Sun|
|     15.42|1.57|   2|Sun|
|     18.43| 3.0|   4|Sun|
|     14.83|3.02|   2|Sun|
|     21.58|3.92|   2|Sun|
|     10.33|1.67|   3|Sun|
|     16.29|3.71|   3|Sun|
|     16.97| 3.5|   3|Sun|
|     20.65|3.35|   3|Sat|
+----------+----+----+---+
only showing top 20 rows



In [15]:
#like sql, get back every col
df.select('*')

DataFrame[total_bill: double, tip: double, sex: string, smoker: string, day: string, time: string, size: bigint]

In [16]:
#reference col name (spark col object), but no data
df.tip

Column<'tip'>

In [17]:
#tip percentage
df.tip / df.total_bill
#syntax looks like pandas, but result is different
#the column represents the transformation, but isn't full of values

Column<'(tip / total_bill)'>

In [18]:
#specify w/ .select () and .show() to show results
df.select(df.tip / df.total_bill).show(5)

+-------------------+
| (tip / total_bill)|
+-------------------+
|0.05944673337257211|
|0.16054158607350097|
|0.16658733936220846|
| 0.1397804054054054|
|0.14680764538430255|
+-------------------+
only showing top 5 rows



In [19]:
#take expressions and store in a new col
col = df.tip / df.total_bill
col
#holds expression that represents new col, but no data

Column<'(tip / total_bill)'>

In [20]:
#see what we end up with and make modifications
df.select(col).show(5)
#by default, name of col, is the operation that was performed
#so add alias (transforms column)

+-------------------+
| (tip / total_bill)|
+-------------------+
|0.05944673337257211|
|0.16054158607350097|
|0.16658733936220846|
| 0.1397804054054054|
|0.14680764538430255|
+-------------------+
only showing top 5 rows



In [21]:
#using .alias to rename col
df.select(col.alias('tip_pct')).show(5)

+-------------------+
|            tip_pct|
+-------------------+
|0.05944673337257211|
|0.16054158607350097|
|0.16658733936220846|
| 0.1397804054054054|
|0.14680764538430255|
+-------------------+
only showing top 5 rows



In [22]:
#add new col to df by doing .select('*', new_col.alias('new_col_name'))
df.select("*", col.alias('tip_pct')).show(5)

+----------+----+------+------+---+------+----+-------------------+
|total_bill| tip|   sex|smoker|day|  time|size|            tip_pct|
+----------+----+------+------+---+------+----+-------------------+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|0.05944673337257211|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|0.16054158607350097|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|0.16658733936220846|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2| 0.1397804054054054|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|0.14680764538430255|
+----------+----+------+------+---+------+----+-------------------+
only showing top 5 rows



In [23]:
#the df is not modified b/c it was not reassigned
df.show(2)

+----------+----+------+------+---+------+----+
|total_bill| tip|   sex|smoker|day|  time|size|
+----------+----+------+------+---+------+----+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|
+----------+----+------+------+---+------+----+
only showing top 2 rows



In [24]:
#to get df w/ tip percentage, we have to assign to variable
df_with_tip_pct = df.select("*", col.alias('tip_pct')).show(5)

+----------+----+------+------+---+------+----+-------------------+
|total_bill| tip|   sex|smoker|day|  time|size|            tip_pct|
+----------+----+------+------+---+------+----+-------------------+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|0.05944673337257211|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|0.16054158607350097|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|0.16658733936220846|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2| 0.1397804054054054|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|0.14680764538430255|
+----------+----+------+------+---+------+----+-------------------+
only showing top 5 rows



## Transforming Columns

In [25]:
from pyspark.sql.functions import sum, mean, concat

#### Built-In Functions

In [26]:
#use built in functions to apply them to cols
df.select(mean(df.tip), sum(df.total_bill)).show()

+----------------+-----------------+
|        avg(tip)|  sum(total_bill)|
+----------------+-----------------+
|2.99827868852459|4827.769999999999|
+----------------+-----------------+



In [27]:
#concatenate text from multiple strings together
df.select(concat('day', 'time')).show(5)

+-----------------+
|concat(day, time)|
+-----------------+
|        SunDinner|
|        SunDinner|
|        SunDinner|
|        SunDinner|
|        SunDinner|
+-----------------+
only showing top 5 rows



In [28]:
#above doesn't look like good format, need to add space character
# df.select(concat('day', ' ' 'time')).show(5)
#used to concating w/ space character, but we get error 
#b/c concat func is trying to resolve everything past it as a name of a col in the df


#Error: 
#AnalysisException: cannot resolve '` time`' given input columns: [day, sex, size, smoker, time, tip, total_bill];
#'Project [concat(day#4, ' time) AS concat(day,  time)#295]
#+- LogicalRDD [total_bill#0, tip#1, sex#2, smoker#3, day#4, time#5, size#6L], false

In [29]:
#for literal (lit) function, needs import
from pyspark.sql.functions import lit

In [30]:
#use .lit() and it'll show proper format w/ literal space character
df.select(concat('day', lit(' '), 'time')).show(5)

+--------------------+
|concat(day,  , time)|
+--------------------+
|          Sun Dinner|
|          Sun Dinner|
|          Sun Dinner|
|          Sun Dinner|
|          Sun Dinner|
+--------------------+
only showing top 5 rows



#### Type Casting

In [31]:
#similar to type casting
df.tip.cast('string')
#gives back col object that represents the transformation specified

Column<'CAST(tip AS STRING)'>

In [32]:
#apply to df w/ .select()
df.select(df.tip.cast('string'))

DataFrame[tip: string]

In [33]:
#view results w/ .show()
df.select(df.tip.cast('string')).show()

+----+
| tip|
+----+
|1.01|
|1.66|
| 3.5|
|3.31|
|3.61|
|4.71|
| 2.0|
|3.12|
|1.96|
|3.23|
|1.71|
| 5.0|
|1.57|
| 3.0|
|3.02|
|3.92|
|1.67|
|3.71|
| 3.5|
|3.35|
+----+
only showing top 20 rows



In [34]:
#now, try doing this to string value and cast to integer, gives back nulls
df.select(df.time.cast('int')).show()
#nulls are like numpy.nan or python None, to indicate absence of a value
#theres no error, like we're used to, spark just converts to 'null'

+----+
|time|
+----+
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
+----+
only showing top 20 rows



#### String Manipulation

In [35]:
#string manipulation w/ regex
#apply to every value in a col
from pyspark.sql.functions import regexp_extract, regexp_replace

In [36]:
#extract pieces of a col (the first thing from time col)
df.select(
    regexp_extract('time', r'(\w).*', 1)
).show(5)                   #first alphanumeric char, 1 to get contents of first capture group


+-------------------------------+
|regexp_extract(time, (\w).*, 1)|
+-------------------------------+
|                              D|
|                              D|
|                              D|
|                              D|
|                              D|
+-------------------------------+
only showing top 5 rows



In [37]:
#add alias
df.select(
    regexp_extract('time', r'(\w).*', 1).alias('first_letter')
).show(5)   

+------------+
|first_letter|
+------------+
|           D|
|           D|
|           D|
|           D|
|           D|
+------------+
only showing top 5 rows



In [38]:
#add in original time col
df.select(
    'time',
    regexp_extract('time', r'(\w).*', 1).alias('first_letter')
).show(5)   

+------+------------+
|  time|first_letter|
+------+------------+
|Dinner|           D|
|Dinner|           D|
|Dinner|           D|
|Dinner|           D|
|Dinner|           D|
+------+------------+
only showing top 5 rows



In [39]:
#use regexp_replace (like re.sub or .str.replace in pandas)
df.select(
    'time',
    regexp_extract('time', r'(\w).*', 1).alias('first_letter'),
    regexp_replace('time', r'[aeiou]', 'X')
).show(5)                #replace any of the vowels w/ a capital X

+------+------------+-----------------------------------+
|  time|first_letter|regexp_replace(time, [aeiou], X, 1)|
+------+------------+-----------------------------------+
|Dinner|           D|                             DXnnXr|
|Dinner|           D|                             DXnnXr|
|Dinner|           D|                             DXnnXr|
|Dinner|           D|                             DXnnXr|
|Dinner|           D|                             DXnnXr|
+------+------------+-----------------------------------+
only showing top 5 rows



#### when and otherwise

In [40]:
#like case... when in sql
#or like np.where (when it meets this condition, then do one thing)
#or like if function in excel
from pyspark.sql.functions import when

In [41]:
#modify df so that it now contains tip_pct col
df = df.select(
    '*',
    (df.tip / df.total_bill).alias('tip_pct')
)
#adds tip percentage column
df 

DataFrame[total_bill: double, tip: double, sex: string, smoker: string, day: string, time: string, size: bigint, tip_pct: double]

In [42]:
#now select everything 
#and when the tip_pct is greater than .2, then say 'Good tip'
#otherwise, say 'not good tip'
df.select(
    '*', 
    when(df.tip_pct > .2, 'Good Tip').otherwise('Not Good Tip')
).show(5)

+----------+----+------+------+---+------+----+-------------------+-------------------------------------------------------------+
|total_bill| tip|   sex|smoker|day|  time|size|            tip_pct|CASE WHEN (tip_pct > 0.2) THEN Good Tip ELSE Not Good Tip END|
+----------+----+------+------+---+------+----+-------------------+-------------------------------------------------------------+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|0.05944673337257211|                                                 Not Good Tip|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|0.16054158607350097|                                                 Not Good Tip|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|0.16658733936220846|                                                 Not Good Tip|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2| 0.1397804054054054|                                                 Not Good Tip|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|0.14680764538430255|                       

In [43]:
#to clean up above output, add alias to when/otherwise statement
#and don't select everything, just tip_pct col instead of everything
df.select(
    'tip_pct', 
    (when(df.tip_pct > .2, 'Good Tip').otherwise('Not Good Tip')).alias('tip_description')
).show(10)

+-------------------+---------------+
|            tip_pct|tip_description|
+-------------------+---------------+
|0.05944673337257211|   Not Good Tip|
|0.16054158607350097|   Not Good Tip|
|0.16658733936220846|   Not Good Tip|
| 0.1397804054054054|   Not Good Tip|
|0.14680764538430255|   Not Good Tip|
|0.18623962040332148|   Not Good Tip|
|0.22805017103762829|       Good Tip|
|0.11607142857142858|   Not Good Tip|
|0.13031914893617022|   Not Good Tip|
| 0.2185385656292287|       Good Tip|
+-------------------+---------------+
only showing top 10 rows



# Spark API Mini Exercises
Copy the code below to create a pandas dataframe with 20 rows and 3 columns:

`import pandas as pd
import numpy as np`

`np.random.seed(13)`

`pandas_dataframe = pd.DataFrame({
    "n": np.random.randn(20),
    "group": np.random.choice(list("xyz"), 20),
    "abool": np.random.choice([True, False], 20),
})`

In [44]:
import pandas as pd
import numpy as np

np.random.seed(13)

### 1. Spark Dataframe Basics

#### i. Use the starter code above to create a pandas dataframe.

In [45]:
pandas_dataframe = pd.DataFrame({
    "n": np.random.randn(20),
    "group": np.random.choice(list("xyz"), 20),
    "abool": np.random.choice([True, False], 20),
})

#### ii. Convert the pandas dataframe to a spark dataframe. From this point forward, do all of your work with the spark dataframe, not the pandas dataframe.

In [46]:
df = spark.createDataFrame(pandas_dataframe)
df

DataFrame[n: double, group: string, abool: boolean]

In [47]:
#can see summary stats w/ .describe()
df.describe().show()

+-------+-------------------+-----+
|summary|                  n|group|
+-------+-------------------+-----+
|  count|                 20|   20|
|   mean|0.36640264498852165| null|
| stddev| 0.8905322898155364| null|
|    min| -1.261605945319069|    x|
|    max| 2.1503829673811126|    z|
+-------+-------------------+-----+



#### iii. Show the first 3 rows of the dataframe.

In [48]:
df.show(3)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
+--------------------+-----+-----+
only showing top 3 rows



#### iv. Show the first 7 rows of the dataframe.

In [49]:
df.show(7)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
| 0.45181233874578974|    y|false|
|  1.3451017084510097|    z|false|
|  0.5323378882945463|    y|false|
|  1.3501878997225267|    z|false|
+--------------------+-----+-----+
only showing top 7 rows



#### v. View a summary of the data using `.describe`.

In [50]:
df.describe().show()

+-------+-------------------+-----+
|summary|                  n|group|
+-------+-------------------+-----+
|  count|                 20|   20|
|   mean|0.36640264498852165| null|
| stddev| 0.8905322898155364| null|
|    min| -1.261605945319069|    x|
|    max| 2.1503829673811126|    z|
+-------+-------------------+-----+



#### vi. Use `.select` to create a new dataframe with just the `n` and `abool` columns. View the first 5 rows of this dataframe.

In [51]:
df.select('n', 'abool').show(5)

+--------------------+-----+
|                   n|abool|
+--------------------+-----+
|  -0.712390662050588|false|
|   0.753766378659703|false|
|-0.04450307833805...|false|
| 0.45181233874578974|false|
|  1.3451017084510097|false|
+--------------------+-----+
only showing top 5 rows



#### vii. Use `.select` to create a new dataframe with just the `group` and `abool` columns. View the first 5 rows of this dataframe.

In [52]:
df.select('group', 'abool').show(5)

+-----+-----+
|group|abool|
+-----+-----+
|    z|false|
|    x|false|
|    z|false|
|    y|false|
|    z|false|
+-----+-----+
only showing top 5 rows



#### viii. Use `.select` to create a new dataframe with the `group` column and the `abool` column renamed to `a_boolean_value`. Show the first 3 rows of this dataframe.

In [53]:
df.select('group', df.abool.alias('a_boolean_value')).show(3)

+-----+---------------+
|group|a_boolean_value|
+-----+---------------+
|    z|          false|
|    x|          false|
|    z|          false|
+-----+---------------+
only showing top 3 rows



#### ix. Use `.select` to create a new dataframe with the `group` column and the `n` column renamed to `a_numeric_value`. Show the first 6 rows of this dataframe.

In [54]:
df.select('group', df.n.alias('a_numeric_value')).show(6)

+-----+--------------------+
|group|     a_numeric_value|
+-----+--------------------+
|    z|  -0.712390662050588|
|    x|   0.753766378659703|
|    z|-0.04450307833805...|
|    y| 0.45181233874578974|
|    z|  1.3451017084510097|
|    y|  0.5323378882945463|
+-----+--------------------+
only showing top 6 rows



In [55]:
#nothing has been reassigned so df is still like original
#none of changes above were saved to a new variable
df.show(5)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
| 0.45181233874578974|    y|false|
|  1.3451017084510097|    z|false|
+--------------------+-----+-----+
only showing top 5 rows



### 2. Column Manipulation

#### i. Use the starter code above to re-create a spark dataframe. Store the spark dataframe in a variable named `df`. 

In [56]:
df.show()

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
| 0.45181233874578974|    y|false|
|  1.3451017084510097|    z|false|
|  0.5323378882945463|    y|false|
|  1.3501878997225267|    z|false|
|  0.8612113741693206|    x|false|
|  1.4786857374358966|    z| true|
| -1.0453771305385342|    y| true|
| -0.7889890249515489|    x|false|
|  -1.261605945319069|    y|false|
|  0.5628467852810314|    y| true|
|-0.24332625188556253|    y| true|
|  0.9137407048596775|    y|false|
| 0.31735092273633597|    x|false|
| 0.12730328020698067|    z|false|
|  2.1503829673811126|    y| true|
|  0.6062886568962988|    x|false|
|-0.02677164998644...|    x| true|
+--------------------+-----+-----+



#### ii. Use `.select` to add 4 to the `n` column. Show the results.

In [57]:
df.select(df.n + 4).show()

+------------------+
|           (n + 4)|
+------------------+
|3.2876093379494122|
| 4.753766378659703|
|3.9554969216619464|
|  4.45181233874579|
|5.3451017084510095|
| 4.532337888294546|
| 5.350187899722527|
|  4.86121137416932|
| 5.478685737435897|
| 2.954622869461466|
|3.2110109750484512|
| 2.738394054680931|
| 4.562846785281032|
|3.7566737481144377|
| 4.913740704859677|
| 4.317350922736336|
| 4.127303280206981|
| 6.150382967381113|
| 4.606288656896298|
|3.9732283500135592|
+------------------+



#### iii. Subtract 5 from the `n` column and view the results.

In [58]:
df.select(df.n - 5).show()

+-------------------+
|            (n - 5)|
+-------------------+
| -5.712390662050588|
| -4.246233621340297|
| -5.044503078338053|
|  -4.54818766125421|
|-3.6548982915489905|
| -4.467662111705454|
|-3.6498121002774733|
|  -4.13878862583068|
| -3.521314262564103|
| -6.045377130538534|
| -5.788989024951549|
| -6.261605945319069|
| -4.437153214718968|
| -5.243326251885563|
| -4.086259295140323|
| -4.682649077263664|
| -4.872696719793019|
|-2.8496170326188874|
| -4.393711343103702|
| -5.026771649986441|
+-------------------+



#### iv. Multiply the `n` column by 2. View the results along with the original numbers.

In [59]:
df.select('n', df.n * 2).show()

+--------------------+--------------------+
|                   n|             (n * 2)|
+--------------------+--------------------+
|  -0.712390662050588|  -1.424781324101176|
|   0.753766378659703|   1.507532757319406|
|-0.04450307833805...|-0.08900615667610691|
| 0.45181233874578974|  0.9036246774915795|
|  1.3451017084510097|  2.6902034169020195|
|  0.5323378882945463|  1.0646757765890926|
|  1.3501878997225267|  2.7003757994450535|
|  0.8612113741693206|  1.7224227483386412|
|  1.4786857374358966|   2.957371474871793|
| -1.0453771305385342| -2.0907542610770684|
| -0.7889890249515489| -1.5779780499030978|
|  -1.261605945319069|  -2.523211890638138|
|  0.5628467852810314|  1.1256935705620628|
|-0.24332625188556253|-0.48665250377112507|
|  0.9137407048596775|   1.827481409719355|
| 0.31735092273633597|  0.6347018454726719|
| 0.12730328020698067| 0.25460656041396135|
|  2.1503829673811126|   4.300765934762225|
|  0.6062886568962988|  1.2125773137925977|
|-0.02677164998644...|-0.0535432

#### v. Add a new column named `n2` that is the `n` value multiplied by -1. Show the first 4 rows of your dataframe. You should see the original `n` value as well as `n2`.

In [60]:
#needs reassigning to modify df
n2 = (df.n * - 1).alias('n2')

df = df.select('*', n2)
df.show(4)

+--------------------+-----+-----+--------------------+
|                   n|group|abool|                  n2|
+--------------------+-----+-----+--------------------+
|  -0.712390662050588|    z|false|   0.712390662050588|
|   0.753766378659703|    x|false|  -0.753766378659703|
|-0.04450307833805...|    z|false|0.044503078338053455|
| 0.45181233874578974|    y|false|-0.45181233874578974|
+--------------------+-----+-----+--------------------+
only showing top 4 rows



#### vi. Add a new column named `n3` that is the n value squared. Show the first 5 rows of your dataframe. You should see both `n`, `n2`, and `n3`.

In [61]:
n3 = (df.n ** 2).alias('n3')

df = df.select('*', n3)
df.show(5)

+--------------------+-----+-----+--------------------+--------------------+
|                   n|group|abool|                  n2|                  n3|
+--------------------+-----+-----+--------------------+--------------------+
|  -0.712390662050588|    z|false|   0.712390662050588|   0.507500455376875|
|   0.753766378659703|    x|false|  -0.753766378659703|  0.5681637535977627|
|-0.04450307833805...|    z|false|0.044503078338053455|0.001980523981562...|
| 0.45181233874578974|    y|false|-0.45181233874578974| 0.20413438944294027|
|  1.3451017084510097|    z|false| -1.3451017084510097|  1.8092986060778251|
+--------------------+-----+-----+--------------------+--------------------+
only showing top 5 rows



#### vii. What happens when you run the code below?

`df.group + df.abool`

In [62]:
#add string to a bool
#spark adds them together, but what if we use .select (cell below)
df.group + df.abool

#this doesn't produce an error, cause spark hasn't done any evaluating yet (w/ .select or .show)

Column<'(group + abool)'>

#### viii. What happens when you run the code below? What is the difference between this and the previous code sample?

`df.select(df.group + df.abool)`

In [63]:
#df.select(df.group + df.abool)

#error message:
#AnalysisException: cannot resolve '(CAST(`group` AS DOUBLE) + `abool`)' due to data type mismatch: differing types in '(CAST(`group` AS DOUBLE) + `abool`)' (double and boolean).;
#'Project [(cast(group#431 as double) + abool#432) AS (group + abool)#843]
#+- Project [n#430, group#431, abool#432, n2#794, POWER(n#430, cast(2 as double)) AS n3#816]
#   +- Project [n#430, group#431, abool#432, (n#430 * cast(-1 as double)) AS n2#794]
#      +- LogicalRDD [n#430, group#431, abool#432], false

#### ix. Try adding various other columns together. What are the results of combining the different data types?

In [64]:
#adding all n's together
n4 = (df.n * 2 + df.n2 *2 + df.n3 *2).alias('n4')

df = df.select(('*'), n4).show(5)

+--------------------+-----+-----+--------------------+--------------------+--------------------+
|                   n|group|abool|                  n2|                  n3|                  n4|
+--------------------+-----+-----+--------------------+--------------------+--------------------+
|  -0.712390662050588|    z|false|   0.712390662050588|   0.507500455376875|    1.01500091075375|
|   0.753766378659703|    x|false|  -0.753766378659703|  0.5681637535977627|  1.1363275071955254|
|-0.04450307833805...|    z|false|0.044503078338053455|0.001980523981562...|0.003961047963125845|
| 0.45181233874578974|    y|false|-0.45181233874578974| 0.20413438944294027| 0.40826877888588053|
|  1.3451017084510097|    z|false| -1.3451017084510097|  1.8092986060778251|  3.6185972121556502|
+--------------------+-----+-----+--------------------+--------------------+--------------------+
only showing top 5 rows



### Type casting

#### i. Use the starter code above to re-create a spark dataframe.

#### ii. Use `.printSchema` to view the datatypes in your dataframe.

#### iii. Use `.dtypes` to view the datatypes in your dataframe.

#### iv. What is the difference between the two code samples below?

`df.abool.cast('int')`


`df.select(df.abool.cast('int')).show()`

#### v. Use `.select` and `.cast` to convert the `abool` column to an integer type. View the results.

#### vi. Convert the `group` column to a integer data type and view the results. What happens?

#### vii. Convert the `n` column to a integer data type and view the results. What happens?

#### viii. Convert the `abool` column to a string data type and view the results. What happens?

### 4. Built-in Functions

#### i. Use the starter code above to re-create a spark dataframe.

#### ii. Import the necessary functions from `pyspark.sql.functions`

#### iii. Find the highest `n` value.

#### iv. Find the lowest `n` value.

#### v. Find the average `n` value.

#### vi. Use `concat` to change the `group` column to say, e.g. "Group: x" or "Group: y"

#### vii. Use `concat` to combine the `n` and `group` columns to produce results that look like this: "x: -1.432" or "z: 2.352"

### 5. When / Otherwise

#### i. Use the starter code above to re-create a spark dataframe.

#### ii. Use `when` and `.otherwise` to create a column that contains the text "It is true" when `abool` is true and "It is false"" when `abool` is false.

#### iii. Create a column that contains 0 if n is less than 0, otherwise, the original n value.

### 6. Filter / Where

#### i. Use the starter code above to re-create a spark dataframe.

#### ii. Use `.filter` or `.where` to select just the rows where the group is `y` and view the results.

#### iii. Select just the columns where the `abool` column is false and view the results.

#### iv. Find the columns where the `group` column is *not* `y`.

#### v. Find the columns where `n` is positive.

#### vi. Find the columns where `abool` is true and the `group` column is `z`.

#### vii. Find the columns where `abool` is true or the `group` column is `z`.

#### viii. Find the columns where `abool` is false and `n` is less than 1.

#### ix. Find the columns where `abool` is false or `n` is less than 1.

### Sorting

#### i. Use the starter code above to re-create a spark dataframe.

#### ii. Sort by the `n` value.

#### iii. Sort by the `group` value, both ascending and descending.

#### iv. Sort by the group value first, then, within each group, sort by `n` value.

#### v. Sort by `abool`, `group`, and `n`. Does it matter in what order you specify the columns when sorting?