## Columns

* Columns simply are the columns of the `DataFrame`
* Columns are selectable and are easy to configuture
* Columns can be added and removed
* Columns represent a simple type like an integer or string, a complex type like an array or map, or a null value

## Bringing in the books dataset

In [32]:
val booksDF = spark.read.format("csv")
                     .option("inferSchema", "true")
                     .option("header", "true")
                     .load("../data/books.csv")

booksDF.printSchema()

root
 |-- bookID: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- authors: string (nullable = true)
 |-- average_rating: string (nullable = true)
 |-- isbn: string (nullable = true)
 |-- isbn13: string (nullable = true)
 |-- language_code: string (nullable = true)
 |-- # num_pages: string (nullable = true)
 |-- ratings_count: integer (nullable = true)
 |-- text_reviews_count: integer (nullable = true)



booksDF: org.apache.spark.sql.DataFrame = [bookID: int, title: string ... 8 more fields]


## Representing `Column`

* Columns are represented by any of the following declarations
* The last two are possible by using `implicits` in Scala

In [33]:
import org.apache.spark.sql.functions.{col, column}

col("someColumnName")
column("someColumnName")
$"someColumnName"
'someColumnName

import org.apache.spark.sql.functions.{col, column}
res27: Symbol = 'someColumnName


## Extracting column representation from `DataFrame`

* We can also use `col` from the `DataFrame` to select one column

In [34]:
booksDF.col("average_rating")

res28: org.apache.spark.sql.Column = average_rating


## Selecting columns to be displayed with `select`

* Select one or more columns using `select`
* You can use whatever column form you please

In [19]:
val subset = booksDF.select($"ratings_count", col("title"), 'authors)
subset.show()

+-------------+--------------------+--------------------+
|ratings_count|               title|             authors|
+-------------+--------------------+--------------------+
|      1944099|Harry Potter and ...|J.K. Rowling-Mary...|
|      1996446|Harry Potter and ...|J.K. Rowling-Mary...|
|      5629932|Harry Potter and ...|J.K. Rowling-Mary...|
|         6267|Harry Potter and ...|        J.K. Rowling|
|      2149872|Harry Potter and ...|J.K. Rowling-Mary...|
|        38872|Harry Potter Boxe...|J.K. Rowling-Mary...|
|           18|Unauthorized Harr...|W. Frederick Zimm...|
|        27410|Harry Potter Coll...|        J.K. Rowling|
|         3602|The Ultimate Hitc...|       Douglas Adams|
|       240189|The Ultimate Hitc...|       Douglas Adams|
|         4416|The Hitchhiker's ...|       Douglas Adams|
|         1222|The Hitchhiker's ...|Douglas Adams-Ste...|
|         2801|The Ultimate Hitc...|       Douglas Adams|
|       228522|A Short History o...|Bill Bryson-Willi...|
|         6993

subset: org.apache.spark.sql.DataFrame = [ratings_count: int, title: string ... 1 more field]


In [20]:
val subset = booksDF.select($"ratings_count".as("num_ratings"), col("title"), 'authors)
subset.show()

+-----------+--------------------+--------------------+
|num_ratings|               title|             authors|
+-----------+--------------------+--------------------+
|    1944099|Harry Potter and ...|J.K. Rowling-Mary...|
|    1996446|Harry Potter and ...|J.K. Rowling-Mary...|
|    5629932|Harry Potter and ...|J.K. Rowling-Mary...|
|       6267|Harry Potter and ...|        J.K. Rowling|
|    2149872|Harry Potter and ...|J.K. Rowling-Mary...|
|      38872|Harry Potter Boxe...|J.K. Rowling-Mary...|
|         18|Unauthorized Harr...|W. Frederick Zimm...|
|      27410|Harry Potter Coll...|        J.K. Rowling|
|       3602|The Ultimate Hitc...|       Douglas Adams|
|     240189|The Ultimate Hitc...|       Douglas Adams|
|       4416|The Hitchhiker's ...|       Douglas Adams|
|       1222|The Hitchhiker's ...|Douglas Adams-Ste...|
|       2801|The Ultimate Hitc...|       Douglas Adams|
|     228522|A Short History o...|Bill Bryson-Willi...|
|       6993|Bill Bryson's Afr...|         Bill 

subset: org.apache.spark.sql.DataFrame = [num_ratings: int, title: string ... 1 more field]


In [21]:
subset.printSchema()

root
 |-- num_ratings: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- authors: string (nullable = true)



## Programmatically accessing columns

* We can access the columns programmatically using `columns` from the `DataFrame`
* This returns an `Array[String]`

In [22]:
booksDF.columns

res20: Array[String] = Array(bookID, title, authors, average_rating, isbn, isbn13, language_code, # num_pages, ratings_count, text_reviews_count)


## Renaming a column

* We can rename a column with `withColumnRenamed` with the first parameter being the old column, and the second being the new one.
* This returns a new `DataFrame`

In [41]:
val booksRenamedDF = booksDF.withColumnRenamed("text_reviews_count", "reviews_count")
booksRenamedDF.printSchema()

root
 |-- bookID: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- authors: string (nullable = true)
 |-- average_rating: string (nullable = true)
 |-- isbn: string (nullable = true)
 |-- isbn13: string (nullable = true)
 |-- language_code: string (nullable = true)
 |-- # num_pages: string (nullable = true)
 |-- ratings_count: integer (nullable = true)
 |-- reviews_count: integer (nullable = true)



booksRenamedDF: org.apache.spark.sql.DataFrame = [bookID: int, title: string ... 8 more fields]


## Creating a new column based on another

* Typically how we engineer our columns is to create another column based on a previous one using `withColumn`

In this example, we are bringing in the `booksDF` and we notice the `# num_pages` is using a `StringType`, so let's use a `cast` and create a new column with a better column name `num_pages` and ensure it is an `IntegerType`

In [42]:
import org.apache.spark.sql.types.IntegerType
val convertedDF = booksRenamedDF.withColumn("num_pages", col("# num_pages").cast(IntegerType))
convertedDF.printSchema()

root
 |-- bookID: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- authors: string (nullable = true)
 |-- average_rating: string (nullable = true)
 |-- isbn: string (nullable = true)
 |-- isbn13: string (nullable = true)
 |-- language_code: string (nullable = true)
 |-- # num_pages: string (nullable = true)
 |-- ratings_count: integer (nullable = true)
 |-- reviews_count: integer (nullable = true)
 |-- num_pages: integer (nullable = true)



import org.apache.spark.sql.types.IntegerType
convertedDF: org.apache.spark.sql.DataFrame = [bookID: int, title: string ... 9 more fields]


## Dropping the columns

* We can drop the columns using `drop`
* Since we have a table to with two columns that represent pages, we can drop the one we don't want

In [43]:
val finalDF = convertedDF.drop("# num_pages")
finalDF.printSchema()

root
 |-- bookID: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- authors: string (nullable = true)
 |-- average_rating: string (nullable = true)
 |-- isbn: string (nullable = true)
 |-- isbn13: string (nullable = true)
 |-- language_code: string (nullable = true)
 |-- ratings_count: integer (nullable = true)
 |-- reviews_count: integer (nullable = true)
 |-- num_pages: integer (nullable = true)



finalDF: org.apache.spark.sql.DataFrame = [bookID: int, title: string ... 8 more fields]


In [35]:
finalDF.show(10)

+------+--------------------+--------------------+--------------+----------+-------------+-------------+-------------+------------------+---------+
|bookID|               title|             authors|average_rating|      isbn|       isbn13|language_code|ratings_count|text_reviews_count|num_pages|
+------+--------------------+--------------------+--------------+----------+-------------+-------------+-------------+------------------+---------+
|     1|Harry Potter and ...|J.K. Rowling-Mary...|          4.56|0439785960|9780439785969|          eng|      1944099|             26249|      652|
|     2|Harry Potter and ...|J.K. Rowling-Mary...|          4.49|0439358078|9780439358071|          eng|      1996446|             27613|      870|
|     3|Harry Potter and ...|J.K. Rowling-Mary...|          4.47|0439554934|9780439554930|          eng|      5629932|             70390|      320|
|     4|Harry Potter and ...|        J.K. Rowling|          4.41|0439554896|9780439554893|          eng|        

## Bring it all together

In [45]:
val final2 = booksDF.withColumnRenamed("text_reviews_count", "reviews_count")
                    .withColumn("num_pages", col("# num_pages").cast(IntegerType))
                    .drop("# num_pages")
final2.printSchema()

root
 |-- bookID: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- authors: string (nullable = true)
 |-- average_rating: string (nullable = true)
 |-- isbn: string (nullable = true)
 |-- isbn13: string (nullable = true)
 |-- language_code: string (nullable = true)
 |-- ratings_count: integer (nullable = true)
 |-- reviews_count: integer (nullable = true)
 |-- num_pages: integer (nullable = true)



final2: org.apache.spark.sql.DataFrame = [bookID: int, title: string ... 8 more fields]


In [46]:
final2.show(10)

+------+--------------------+--------------------+--------------+----------+-------------+-------------+-------------+-------------+---------+
|bookID|               title|             authors|average_rating|      isbn|       isbn13|language_code|ratings_count|reviews_count|num_pages|
+------+--------------------+--------------------+--------------+----------+-------------+-------------+-------------+-------------+---------+
|     1|Harry Potter and ...|J.K. Rowling-Mary...|          4.56|0439785960|9780439785969|          eng|      1944099|        26249|      652|
|     2|Harry Potter and ...|J.K. Rowling-Mary...|          4.49|0439358078|9780439358071|          eng|      1996446|        27613|      870|
|     3|Harry Potter and ...|J.K. Rowling-Mary...|          4.47|0439554934|9780439554930|          eng|      5629932|        70390|      320|
|     4|Harry Potter and ...|        J.K. Rowling|          4.41|0439554896|9780439554893|          eng|         6267|          272|      352|

## Lab: Column Manipulation

**Step 1:** Using the Schema provided for us by the inferencer already provided for you.

**Step 2:** Print the Schema

**Step 3:** Rename `_c0` to something better like `id`

**Step 4:** Convert column `price` to Double Types

**Step 5:** Convert column `points` to Integer Types

**Step 4:** Convert columns `id` to Integer Type

**Step 5:** Print the Schema

**Step 6:** Show the Dataset

In [63]:
val winesDF = spark.read.format("csv")
                     .option("inferSchema", "true")
                     .option("header", "true")
                     .load("../data/winemag.csv")

winesDF: org.apache.spark.sql.DataFrame = [_c0: string, country: string ... 12 more fields]


In [64]:
import org.apache.spark.sql.types.DoubleType
val cleanWinesDF = winesDF
                     .withColumn("price", $"price".cast(DoubleType))
                     .withColumnRenamed("_c0", "id")
                     .withColumn("id", $"id".cast(IntegerType))
                     .withColumn("points", $"points".cast(IntegerType))
cleanWinesDF.printSchema()

root
 |-- id: integer (nullable = true)
 |-- country: string (nullable = true)
 |-- description: string (nullable = true)
 |-- designation: string (nullable = true)
 |-- points: integer (nullable = true)
 |-- price: double (nullable = true)
 |-- province: string (nullable = true)
 |-- region_1: string (nullable = true)
 |-- region_2: string (nullable = true)
 |-- taster_name: string (nullable = true)
 |-- taster_twitter_handle: string (nullable = true)
 |-- title: string (nullable = true)
 |-- variety: string (nullable = true)
 |-- winery: string (nullable = true)



import org.apache.spark.sql.types.DoubleType
cleanWinesDF: org.apache.spark.sql.DataFrame = [id: int, country: string ... 12 more fields]


In [62]:
cleanWinesDF.show(10)

+---+--------+--------------------+--------------------+------+-----+-----------------+-------------------+-----------------+------------------+---------------------+--------------------+------------------+-------------------+
| id| country|         description|         designation|points|price|         province|           region_1|         region_2|       taster_name|taster_twitter_handle|               title|           variety|             winery|
+---+--------+--------------------+--------------------+------+-----+-----------------+-------------------+-----------------+------------------+---------------------+--------------------+------------------+-------------------+
|  0|   Italy|Aromas include tr...|        Vulkà Bianco|    87| null|Sicily & Sardinia|               Etna|             null|     Kerin O’Keefe|         @kerinokeefe|Nicosia 2013 Vulk...|       White Blend|            Nicosia|
|  1|Portugal|This is ripe and ...|            Avidagos|    87| 15.0|            Douro|     