- Title: Rename and Drop Columns in Spark DataFrames
- Slug: spark-rename-columns
- Date: 2020-07-19 14:24:40
- Category: Computer Science
- Tags: programming, Scala, Spark, DataFrame, rename, column, drop
- Author: Ben Du

## Comment

You can use `withColumnRenamed` to rename a column in a DataFrame.
You can also do renaming using `alias` when select columns.

In [2]:
interp.load.ivy("org.apache.spark" % "spark-core_2.12" % "3.0.0")
interp.load.ivy("org.apache.spark" % "spark-sql_2.12" % "3.0.0")

In [3]:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

val spark = SparkSession.builder()
    .master("local[2]")
    .appName("Spark_DataFrame_Column")
    .getOrCreate()

import spark.implicits._

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/07/19 14:26:45 INFO SparkContext: Running Spark version 3.0.0
20/07/19 14:26:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/07/19 14:26:46 INFO ResourceUtils: Resources for spark.driver:

20/07/19 14:26:46 INFO SparkContext: Submitted application: Spark_DataFrame_Column
20/07/19 14:26:46 INFO SecurityManager: Changing view acls to: gitpod
20/07/19 14:26:46 INFO SecurityManager: Changing modify acls to: gitpod
20/07/19 14:26:46 INFO SecurityManager: Changing view acls groups to: 
20/07/19 14:26:46 INFO SecurityManager: Changing modify acls groups to: 
20/07/19 14:26:46 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(gitpod); groups with view permissions: Set(); users  with modify permissions: Set(gitpod); groups with modify permissions: Set()
20/07/19 14:2

[32mimport [39m[36morg.apache.spark.sql.SparkSession
[39m
[32mimport [39m[36morg.apache.spark.sql.functions._

[39m
[36mspark[39m: [32mSparkSession[39m = org.apache.spark.sql.SparkSession@234915a7
[32mimport [39m[36mspark.implicits._[39m

In [4]:
val df = Seq(
    (1L, "a", "foo", 3.0),
    (2L, "b", "bar", 4.0),
    (3L, "c", "foo", 5.0),
    (4L, "d", "bar", 7.0)
).toDF
df.show

20/07/19 14:26:52 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/workspace/blog/misc/content/spark-warehouse').
20/07/19 14:26:52 INFO SharedState: Warehouse path is 'file:/workspace/blog/misc/content/spark-warehouse'.
20/07/19 14:26:54 INFO CodeGenerator: Code generated in 517.780589 ms
20/07/19 14:26:56 INFO CodeGenerator: Code generated in 36.279689 ms
20/07/19 14:26:56 INFO CodeGenerator: Code generated in 30.128604 ms


+---+---+---+---+
| _1| _2| _3| _4|
+---+---+---+---+
|  1|  a|foo|3.0|
|  2|  b|bar|4.0|
|  3|  c|foo|5.0|
|  4|  d|bar|7.0|
+---+---+---+---+



[36mdf[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mpackage[39m.[32mDataFrame[39m = [_1: bigint, _2: string ... 2 more fields]

## Drop Columns

In [5]:
df.drop("_1", "_3").show

20/07/19 14:27:45 INFO CodeGenerator: Code generated in 22.343436 ms
20/07/19 14:27:45 INFO CodeGenerator: Code generated in 25.541113 ms


+---+---+
| _2| _4|
+---+---+
|  a|3.0|
|  b|4.0|
|  c|5.0|
|  d|7.0|
+---+---+



## Renaming One Column Using `withColumnRenamed`

In [8]:
df.withColumnRenamed("_1", "x1").show

+---+---+---+---+
| x1| _2| _3| _4|
+---+---+---+---+
|  1|  a|foo|3.0|
|  2|  b|bar|4.0|
|  3|  c|foo|5.0|
|  4|  d|bar|7.0|
+---+---+---+---+



## Renaming One Column Using `alias`

In [9]:
df.select(
    $"_1".alias("x1"),
    $"_2",
    $"_3",
    $"_4"
).show

+---+---+---+---+
| x1| _2| _3| _4|
+---+---+---+---+
|  1|  a|foo|3.0|
|  2|  b|bar|4.0|
|  3|  c|foo|5.0|
|  4|  d|bar|7.0|
+---+---+---+---+



## Batch Renaming Using `withColumnRenamed`

In [12]:
val lookup = Map(
    "_1" -> "x1",
    "_2" -> "x2",
    "_3" -> "x3",
    "_4" -> "x4"
)

In [13]:
lookup.foldLeft(df) {
    (acc, ca) => acc.withColumnRenamed(ca._1, ca._2)
}.show

+---+---+---+---+
| x1| x2| x3| x4|
+---+---+---+---+
|  1|  a|foo|3.0|
|  2|  b|bar|4.0|
|  3|  c|foo|5.0|
|  4|  d|bar|7.0|
+---+---+---+---+



## Batch Renaming Using `alias`

In [14]:
df.select(df.columns.map(c => col(c).alias(lookup.getOrElse(c, c))): _*).show

+---+---+---+---+
| x1| x2| x3| x4|
+---+---+---+---+
|  1|  a|foo|3.0|
|  2|  b|bar|4.0|
|  3|  c|foo|5.0|
|  4|  d|bar|7.0|
+---+---+---+---+



## References

https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/Dataset.html

https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/functions.html

https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Row.html