In [3]:
import pyspark

spark = pyspark.sql.SparkSession.builder.getOrCreate()

In [1]:
import pandas as pd
import numpy as np

np.random.seed(13)

pandas_dataframe = pd.DataFrame(
    {
        "n": np.random.randn(20),
        "group": np.random.choice(list("xyz"), 20),
        "abool": np.random.choice([True, False], 20),
    }
)

## Spark Dataframe Basics

- Use the starter code above to create a pandas dataframe.
- Convert the pandas dataframe to a spark dataframe. From this point forward, do all of your work with the spark dataframe, not the pandas dataframe.
- Show the first 3 rows of the dataframe.
- Show the first 7 rows of the dataframe.
- View a summary of the data using .describe.
- Use .select to create a new dataframe with just the n and abool columns. View the first 5 rows of this dataframe.
- Use .select to create a new dataframe with just the group and abool columns. View the first 5 rows of this dataframe.
- Use .select to create a new dataframe with the group column and the abool column renamed to a_boolean_value. Show the first 3 rows of this dataframe.
- Use .select to create a new dataframe with the group column and the n column renamed to a_numeric_value. Show the first 6 rows of this dataframe.

In [7]:
df = spark.createDataFrame(pandas_dataframe)

In [8]:
df.show(5) # Show the first 3 rows of the dataframe.
df.show(7) # Show the first 7 rows of the dataframe.

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
| 0.45181233874578974|    y|false|
|  1.3451017084510097|    z|false|
+--------------------+-----+-----+
only showing top 5 rows

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
| 0.45181233874578974|    y|false|
|  1.3451017084510097|    z|false|
|  0.5323378882945463|    y|false|
|  1.3501878997225267|    z|false|
+--------------------+-----+-----+
only showing top 7 rows



In [9]:
df.describe().show()

+-------+-------------------+-----+
|summary|                  n|group|
+-------+-------------------+-----+
|  count|                 20|   20|
|   mean|0.36640264498852165| null|
| stddev| 0.8905322898155364| null|
|    min| -1.261605945319069|    x|
|    max| 2.1503829673811126|    z|
+-------+-------------------+-----+



Use .select to create a new dataframe with just the n and abool columns. View the first 5 rows of this dataframe.

In [12]:
n_abool = df.select("n","abool").show(5)

+--------------------+-----+
|                   n|abool|
+--------------------+-----+
|  -0.712390662050588|false|
|   0.753766378659703|false|
|-0.04450307833805...|false|
| 0.45181233874578974|false|
|  1.3451017084510097|false|
+--------------------+-----+
only showing top 5 rows



Use .select to create a new dataframe with just the group and abool columns. View the first 5 rows of this dataframe.

In [13]:
group_abool = df.select("group","abool").show(5)

+-----+-----+
|group|abool|
+-----+-----+
|    z|false|
|    x|false|
|    z|false|
|    y|false|
|    z|false|
+-----+-----+
only showing top 5 rows



Use .select to create a new dataframe with the group column and the abool column renamed to a_boolean_value. Show the first 3 rows of this dataframe.

In [15]:
a_boolean_value = df.abool.alias("a_boolean_value")

In [16]:
group_abv = df.select("group",a_boolean_value).show(3)

+-----+---------------+
|group|a_boolean_value|
+-----+---------------+
|    z|          false|
|    x|          false|
|    z|          false|
+-----+---------------+
only showing top 3 rows



Use .select to create a new dataframe with the group column and the n column renamed to a_numeric_value. Show the first 6 rows of this dataframe.

In [17]:
a_numeric_value = df.n.alias("a_numeric_value")

In [18]:
group_anv = df.select("group",a_numeric_value).show(6)

+-----+--------------------+
|group|     a_numeric_value|
+-----+--------------------+
|    z|  -0.712390662050588|
|    x|   0.753766378659703|
|    z|-0.04450307833805...|
|    y| 0.45181233874578974|
|    z|  1.3451017084510097|
|    y|  0.5323378882945463|
+-----+--------------------+
only showing top 6 rows



## Column Manipulation

- Use the starter code above to re-create a spark dataframe. Store the spark dataframe in a varaible named df
- Use .select to add 4 to the n column. Show the results.
- Subtract 5 from the n column and view the results.
- Multiply the n column by 2. View the results along with the original numbers.
- Add a new column named n2 that is the n value multiplied by -1. Show the first 4 rows of your dataframe. You should see the original n value as well as n2.
- Add a new column named n3 that is the n value squared. Show the first 5 rows of your dataframe. You should see both n, n2, and n3.
- What happens when you run the code below?

    `df.group + df.abool`
    
- What happens when you run the code below? What is the difference between this and the previous code sample?
    
    `df.select(df.group + df.abool)`
    
- Try adding various other columns together. What are the results of combining the different data types?

In [20]:
df.show(5)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
| 0.45181233874578974|    y|false|
|  1.3451017084510097|    z|false|
+--------------------+-----+-----+
only showing top 5 rows



Use .select to add 4 to the n column. Show the results.

In [24]:
n = df.n +4

In [25]:
df.select(n,"group","abool").show(5)

+------------------+-----+-----+
|           (n + 4)|group|abool|
+------------------+-----+-----+
|3.2876093379494122|    z|false|
| 4.753766378659703|    x|false|
|3.9554969216619464|    z|false|
|  4.45181233874579|    y|false|
|5.3451017084510095|    z|false|
+------------------+-----+-----+
only showing top 5 rows



Subtract 5 from the n column and view the results.

In [26]:
n = df.n - 5

In [27]:
df.select(n, "group","abool").show(5)

+-------------------+-----+-----+
|            (n - 5)|group|abool|
+-------------------+-----+-----+
| -5.712390662050588|    z|false|
| -4.246233621340297|    x|false|
| -5.044503078338053|    z|false|
|  -4.54818766125421|    y|false|
|-3.6548982915489905|    z|false|
+-------------------+-----+-----+
only showing top 5 rows



Multiply the n column by 2. View the results along with the original numbers.

In [28]:
n_times_2 = df.n * 2

In [33]:
df.select('*', n_times_2.alias("n_times_2")).show(5)

+--------------------+-----+-----+--------------------+
|                   n|group|abool|           n_times_2|
+--------------------+-----+-----+--------------------+
|  -0.712390662050588|    z|false|  -1.424781324101176|
|   0.753766378659703|    x|false|   1.507532757319406|
|-0.04450307833805...|    z|false|-0.08900615667610691|
| 0.45181233874578974|    y|false|  0.9036246774915795|
|  1.3451017084510097|    z|false|  2.6902034169020195|
+--------------------+-----+-----+--------------------+
only showing top 5 rows



Add a new column named n2 that is the n value multiplied by -1. Show the first 4 rows of your dataframe. You should see the original n value as well as n2.

In [34]:
n_times_neg1 = df.n * -1

In [35]:
df.select("*", n_times_neg1.alias("n2")).show(4)

+--------------------+-----+-----+--------------------+
|                   n|group|abool|                  n2|
+--------------------+-----+-----+--------------------+
|  -0.712390662050588|    z|false|   0.712390662050588|
|   0.753766378659703|    x|false|  -0.753766378659703|
|-0.04450307833805...|    z|false|0.044503078338053455|
| 0.45181233874578974|    y|false|-0.45181233874578974|
+--------------------+-----+-----+--------------------+
only showing top 4 rows



Add a new column named n3 that is the n value squared. Show the first 5 rows of your dataframe. You should see both n, n2, and n3.

In [37]:
n_squared = df.n * df.n

In [39]:
df.select("*", n_times_neg1.alias("n2"), n_squared.alias("n3")).show(5)

+--------------------+-----+-----+--------------------+--------------------+
|                   n|group|abool|                  n2|                  n3|
+--------------------+-----+-----+--------------------+--------------------+
|  -0.712390662050588|    z|false|   0.712390662050588|   0.507500455376875|
|   0.753766378659703|    x|false|  -0.753766378659703|  0.5681637535977627|
|-0.04450307833805...|    z|false|0.044503078338053455|0.001980523981562...|
| 0.45181233874578974|    y|false|-0.45181233874578974| 0.20413438944294027|
|  1.3451017084510097|    z|false| -1.3451017084510097|  1.8092986060778251|
+--------------------+-----+-----+--------------------+--------------------+
only showing top 5 rows



What happens when you run the code below?

In [47]:
grp_plus_abool = df.group + df.abool

It creates a column object, which holds a transformation that adds group and abool columns

What happens when you run the code below? What is the difference between this and the previous code sample?

In [45]:
df.printSchema()

root
 |-- n: double (nullable = true)
 |-- group: string (nullable = true)
 |-- abool: boolean (nullable = true)



In [44]:
df.select(df.group + df.abool).show(5)

AnalysisException: "cannot resolve '(CAST(`group` AS DOUBLE) + `abool`)' due to data type mismatch: differing types in '(CAST(`group` AS DOUBLE) + `abool`)' (double and boolean).;;\n'Project [(cast(group#49 as double) + abool#50) AS (group + abool)#362]\n+- LogicalRDD [n#48, group#49, abool#50], false\n"

In [48]:
df.select(grp_plus_abool).show(5)

AnalysisException: "cannot resolve '(CAST(`group` AS DOUBLE) + `abool`)' due to data type mismatch: differing types in '(CAST(`group` AS DOUBLE) + `abool`)' (double and boolean).;;\n'Project [(cast(group#49 as double) + abool#50) AS (group + abool)#363]\n+- LogicalRDD [n#48, group#49, abool#50], false\n"

The above creates a Java error because we cannot perform arithmetic operations between a string and a boolean datatype.

Try adding various other columns together. What are the results of combining the different data types?

In [46]:
df.select("n"+"abool")

AnalysisException: "cannot resolve '`nabool`' given input columns: [n, group, abool];;\n'Project ['nabool]\n+- LogicalRDD [n#48, group#49, abool#50], false\n"