## 1.) Spark Dataframe Basics

In [9]:
import pandas as pd
import numpy as np
import pyspark

In [7]:
np.random.seed(13)

pandas_dataframe = pd.DataFrame(
    {
        "n": np.random.randn(20),
        "group": np.random.choice(list("xyz"), 20),
        "abool": np.random.choice([True, False], 20),
    }
)

Convert the pandas dataframe to a spark dataframe. From this point forward, do all of your work with the spark dataframe, not the pandas dataframe.

In [12]:
spark = pyspark.sql.SparkSession.builder.getOrCreate()
df = spark.createDataFrame(pandas_dataframe)
df.show()

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
| 0.45181233874578974|    y|false|
|  1.3451017084510097|    z|false|
|  0.5323378882945463|    y|false|
|  1.3501878997225267|    z|false|
|  0.8612113741693206|    x|false|
|  1.4786857374358966|    z| true|
| -1.0453771305385342|    y| true|
| -0.7889890249515489|    x|false|
|  -1.261605945319069|    y|false|
|  0.5628467852810314|    y| true|
|-0.24332625188556253|    y| true|
|  0.9137407048596775|    y|false|
| 0.31735092273633597|    x|false|
| 0.12730328020698067|    z|false|
|  2.1503829673811126|    y| true|
|  0.6062886568962988|    x|false|
|-0.02677164998644...|    x| true|
+--------------------+-----+-----+



Show the first 3 rows of the dataframe.

In [14]:
df.show(3)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
+--------------------+-----+-----+
only showing top 3 rows



Show the first 7 rows of the dataframe.

In [15]:
df.show(7)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
| 0.45181233874578974|    y|false|
|  1.3451017084510097|    z|false|
|  0.5323378882945463|    y|false|
|  1.3501878997225267|    z|false|
+--------------------+-----+-----+
only showing top 7 rows



View a summary of the data using .describe

In [18]:
df.describe().show()

+-------+-------------------+-----+
|summary|                  n|group|
+-------+-------------------+-----+
|  count|                 20|   20|
|   mean|0.36640264498852165| null|
| stddev| 0.8905322898155364| null|
|    min| -1.261605945319069|    x|
|    max| 2.1503829673811126|    z|
+-------+-------------------+-----+



Use .select to create a new dataframe with just the n and abool columns. View the first 5 rows of this dataframe.

In [26]:
df.select('n','abool').show(5)

+--------------------+-----+
|                   n|abool|
+--------------------+-----+
|  -0.712390662050588|false|
|   0.753766378659703|false|
|-0.04450307833805...|false|
| 0.45181233874578974|false|
|  1.3451017084510097|false|
+--------------------+-----+
only showing top 5 rows



Use .select to create a new dataframe with just the group and abool columns. View the first 5 rows of this dataframe.

In [24]:
df.select('group','abool').show(5)

+-----+-----+
|group|abool|
+-----+-----+
|    z|false|
|    x|false|
|    z|false|
|    y|false|
|    z|false|
+-----+-----+
only showing top 5 rows



Use .select to create a new dataframe with the group column and the abool column renamed to a_boolean_value. Show the first 3 rows of this dataframe.

In [28]:
df.select('group',df.abool.alias('boolean_value')).show(3)

+-----+-------------+
|group|boolean_value|
+-----+-------------+
|    z|        false|
|    x|        false|
|    z|        false|
+-----+-------------+
only showing top 3 rows



Use .select to create a new dataframe with the group column and the n column renamed to a_numeric_value. Show the first 6 rows of this dataframe.

In [29]:
df.select('group',df.n.alias('a_numeric_value')).show(6)

+-----+--------------------+
|group|     a_numeric_value|
+-----+--------------------+
|    z|  -0.712390662050588|
|    x|   0.753766378659703|
|    z|-0.04450307833805...|
|    y| 0.45181233874578974|
|    z|  1.3451017084510097|
|    y|  0.5323378882945463|
+-----+--------------------+
only showing top 6 rows



## 2.) Column Manipulation

Use the starter code above to re-create a spark dataframe. Store the spark dataframe in a varaible named df

In [30]:
df = spark.createDataFrame(pandas_dataframe)

Use .select to add 4 to the n column. Show the results.

In [35]:
df.select(df.n + 4).show()

+------------------+
|           (n + 4)|
+------------------+
|3.2876093379494122|
| 4.753766378659703|
|3.9554969216619464|
|  4.45181233874579|
|5.3451017084510095|
| 4.532337888294546|
| 5.350187899722527|
|  4.86121137416932|
| 5.478685737435897|
| 2.954622869461466|
|3.2110109750484512|
| 2.738394054680931|
| 4.562846785281032|
|3.7566737481144377|
| 4.913740704859677|
| 4.317350922736336|
| 4.127303280206981|
| 6.150382967381113|
| 4.606288656896298|
|3.9732283500135592|
+------------------+



Subtract 5 from the n column and view the results.

In [36]:
df.select(df.n - 5).show()

+-------------------+
|            (n - 5)|
+-------------------+
| -5.712390662050588|
| -4.246233621340297|
| -5.044503078338053|
|  -4.54818766125421|
|-3.6548982915489905|
| -4.467662111705454|
|-3.6498121002774733|
|  -4.13878862583068|
| -3.521314262564103|
| -6.045377130538534|
| -5.788989024951549|
| -6.261605945319069|
| -4.437153214718968|
| -5.243326251885563|
| -4.086259295140323|
| -4.682649077263664|
| -4.872696719793019|
|-2.8496170326188874|
| -4.393711343103702|
| -5.026771649986441|
+-------------------+



Multiply the n column by 2. View the results along with the original numbers.

In [38]:
df.select(df.n * 2, df.n).show()

+--------------------+--------------------+
|             (n * 2)|                   n|
+--------------------+--------------------+
|  -1.424781324101176|  -0.712390662050588|
|   1.507532757319406|   0.753766378659703|
|-0.08900615667610691|-0.04450307833805...|
|  0.9036246774915795| 0.45181233874578974|
|  2.6902034169020195|  1.3451017084510097|
|  1.0646757765890926|  0.5323378882945463|
|  2.7003757994450535|  1.3501878997225267|
|  1.7224227483386412|  0.8612113741693206|
|   2.957371474871793|  1.4786857374358966|
| -2.0907542610770684| -1.0453771305385342|
| -1.5779780499030978| -0.7889890249515489|
|  -2.523211890638138|  -1.261605945319069|
|  1.1256935705620628|  0.5628467852810314|
|-0.48665250377112507|-0.24332625188556253|
|   1.827481409719355|  0.9137407048596775|
|  0.6347018454726719| 0.31735092273633597|
| 0.25460656041396135| 0.12730328020698067|
|   4.300765934762225|  2.1503829673811126|
|  1.2125773137925977|  0.6062886568962988|
|-0.05354329997288145|-0.0267716

Add a new column named n2 that is the n value multiplied by -1. Show the first 4 rows of your dataframe. You should see the original n value as well as n2.

In [57]:
n = df.select(df.n, (df.n * -1).alias('n2'))
n.show(4)

+--------------------+--------------------+
|                   n|                  n2|
+--------------------+--------------------+
|  -0.712390662050588|   0.712390662050588|
|   0.753766378659703|  -0.753766378659703|
|-0.04450307833805...|0.044503078338053455|
| 0.45181233874578974|-0.45181233874578974|
+--------------------+--------------------+
only showing top 4 rows



Add a new column named n3 that is the n value squared. Show the first 5 rows of your dataframe. You should see both n, n2, and n3.

In [58]:
n.select('*', (n.n ** 2).alias('n3')).show(5)

+--------------------+--------------------+--------------------+
|                   n|                  n2|                  n3|
+--------------------+--------------------+--------------------+
|  -0.712390662050588|   0.712390662050588|   0.507500455376875|
|   0.753766378659703|  -0.753766378659703|  0.5681637535977627|
|-0.04450307833805...|0.044503078338053455|0.001980523981562...|
| 0.45181233874578974|-0.45181233874578974| 0.20413438944294027|
|  1.3451017084510097| -1.3451017084510097|  1.8092986060778251|
+--------------------+--------------------+--------------------+
only showing top 5 rows



What happens when you run the code below?

df.group + df.abool

A column object is created that is defined as df.group + df.abool. We do not see an error here becuase this is just a column transformation. 

What happens when you run the code below? What is the difference between this and the previous code sample?

df.select(df.group + df.abool)

An error message is produced when we try to select this because the columns are different data types.

## 3.) Spark SQL

Use the starter code above to re-create a spark dataframe.

In [64]:
df = spark.createDataFrame(pandas_dataframe)

Turn your dataframe into a table that can be queried with spark SQL. Name the table my_df. Answer the rest of the questions in this section with a spark sql query (spark.sql) against my_df. After each step, view the first 7 records from the dataframe.

In [65]:
df.createOrReplaceTempView('df')

In [70]:
my_df = spark.sql('''SELECT * FROM df''')
type(my_df)

pyspark.sql.dataframe.DataFrame

Write a query that shows all of the columns from your dataframe.

In [82]:
my_df.show(7)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
| 0.45181233874578974|    y|false|
|  1.3451017084510097|    z|false|
|  0.5323378882945463|    y|false|
|  1.3501878997225267|    z|false|
+--------------------+-----+-----+
only showing top 7 rows



Write a query that shows just the n and abool columns from the dataframe.

In [84]:
my_df.select('n','abool').show(7)

+--------------------+-----+
|                   n|abool|
+--------------------+-----+
|  -0.712390662050588|false|
|   0.753766378659703|false|
|-0.04450307833805...|false|
| 0.45181233874578974|false|
|  1.3451017084510097|false|
|  0.5323378882945463|false|
|  1.3501878997225267|false|
+--------------------+-----+
only showing top 7 rows



Write a query that shows just the n and group columns. Rename the group column to g.

In [85]:
my_df.select('n',df.group.alias('g')).show(7)

+--------------------+---+
|                   n|  g|
+--------------------+---+
|  -0.712390662050588|  z|
|   0.753766378659703|  x|
|-0.04450307833805...|  z|
| 0.45181233874578974|  y|
|  1.3451017084510097|  z|
|  0.5323378882945463|  y|
|  1.3501878997225267|  z|
+--------------------+---+
only showing top 7 rows



Write a query that selects n, and creates two new columns: n2, the original n values halved, and n3: the original n values minus 1.

In [91]:
n2 = (my_df.n / 2).alias('n2')
n3 = (my_df.n - 1).alias('n3')
my_df.select('n',n2,n3).show(7)

+--------------------+--------------------+--------------------+
|                   n|                  n2|                  n3|
+--------------------+--------------------+--------------------+
|  -0.712390662050588|  -0.356195331025294|  -1.712390662050588|
|   0.753766378659703|  0.3768831893298515|-0.24623362134029703|
|-0.04450307833805...|-0.02225153916902...| -1.0445030783380536|
| 0.45181233874578974| 0.22590616937289487| -0.5481876612542103|
|  1.3451017084510097|  0.6725508542255049| 0.34510170845100974|
|  0.5323378882945463| 0.26616894414727316| -0.4676621117054537|
|  1.3501878997225267|  0.6750939498612634| 0.35018789972252673|
+--------------------+--------------------+--------------------+
only showing top 7 rows



What happens if you make a SQL syntax error in your query?

A Py4JJavaError error is printed. It explains there is an issue with the sql code

## 4.) Type Casting

Use the starter code above to re-create a spark dataframe.

In [108]:
df = spark.createDataFrame(pandas_dataframe)
df.show(10)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
| 0.45181233874578974|    y|false|
|  1.3451017084510097|    z|false|
|  0.5323378882945463|    y|false|
|  1.3501878997225267|    z|false|
|  0.8612113741693206|    x|false|
|  1.4786857374358966|    z| true|
| -1.0453771305385342|    y| true|
+--------------------+-----+-----+
only showing top 10 rows



Use .printSchema to view the datatypes in your dataframe.

In [95]:
df.printSchema()

root
 |-- n: double (nullable = true)
 |-- group: string (nullable = true)
 |-- abool: boolean (nullable = true)



Use .dtypes to view the datatypes in your dataframe.

In [96]:
df.dtypes

[('n', 'double'), ('group', 'string'), ('abool', 'boolean')]

What is the difference between the two code samples below?

df.abool.cast('int')


df.select(df.abool.cast('int')).show(3)

The first line of code casts the abool column with the data type integer. This is a simple transformation. The second line shows the new transformed data. This is an action. 

Use .select and .cast to convert the abool column to an integer type. View the results.

In [103]:
df.select(df.abool.cast('int')).show(10)

+-----+
|abool|
+-----+
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    1|
|    1|
+-----+
only showing top 10 rows



Convert the group column to an integer data type and view the results. What happens?

This returns null values because the group column contains non numeric values.

In [112]:
df.select(df.group.cast('int')).show(3)

+-----+
|group|
+-----+
| null|
| null|
| null|
+-----+
only showing top 3 rows



Convert the n column to a integer data type and view the results. What happens?

The n column is converted to data type integer. The values are rounded down to the nearest whole number.

In [110]:
df.select(df.n.cast('int')).show(5)

+---+
|  n|
+---+
|  0|
|  0|
|  0|
|  0|
|  1|
+---+
only showing top 5 rows



Convert the abool column to a string data type and view the results. What happens?

The abool column is converted to a string without any issues becuase the booleans were already expressed in alpha characters. 

In [106]:
df.select(df.abool.cast('string')).show(3)

+-----+
|abool|
+-----+
|false|
|false|
|false|
+-----+
only showing top 3 rows



## 5.) Built-in Functions

Use the starter code above to re-create a spark dataframe.

In [113]:
df = spark.createDataFrame(pandas_dataframe)
df.show(10)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
| 0.45181233874578974|    y|false|
|  1.3451017084510097|    z|false|
|  0.5323378882945463|    y|false|
|  1.3501878997225267|    z|false|
|  0.8612113741693206|    x|false|
|  1.4786857374358966|    z| true|
| -1.0453771305385342|    y| true|
+--------------------+-----+-----+
only showing top 10 rows



Import the necessary functions from pyspark.sql.functions

In [115]:
from pyspark.sql.functions import min, max, mean

Find the highest n value.

Find the lowest n value.

Find the average n value.

Use concat to change the group column to say, e.g. "Group: x" or "Group: y"

Use concat to combine the n and group columns to produce results that look like this: "x: -1.432" or "z: 2.352