## Drop column from Delta Lake table

This notebook demonstrates how to drop a column of a Delta Lake table.

It demonstrates how the column mapping functionality that was added in Delta 1.2 makes this operation a lot more efficient.

In [1]:
import pyspark
from delta import *

In [2]:
builder = (
    pyspark.sql.SparkSession.builder.appName("MyApp")
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config(
        "spark.sql.catalog.spark_catalog",
        "org.apache.spark.sql.delta.catalog.DeltaCatalog",
    )
)

In [3]:
spark = configure_spark_with_delta_pip(builder).getOrCreate()

:: loading settings :: url = jar:file:/Users/matthew.powers/opt/miniconda3/envs/pyspark-330-delta-210/lib/python3.9/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /Users/matthew.powers/.ivy2/cache
The jars for the packages stored in: /Users/matthew.powers/.ivy2/jars
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-583f79c3-ad4d-46a1-aaab-e14079a498ae;1.0
	confs: [default]
	found io.delta#delta-core_2.12;2.1.0 in central
	found io.delta#delta-storage;2.1.0 in central
	found org.antlr#antlr4-runtime;4.8 in central
	found org.codehaus.jackson#jackson-core-asl;1.9.13 in central
:: resolution report :: resolve 331ms :: artifacts dl 20ms
	:: modules in use:
	io.delta#delta-core_2.12;2.1.0 from central in [default]
	io.delta#delta-storage;2.1.0 from central in [default]
	org.antlr#antlr4-runtime;4.8 from central in [default]
	org.codehaus.jackson#jackson-core-asl;1.9.13 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number|

22/09/13 10:31:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


## Create Delta Lake

In [4]:
spark.sql("drop table if exists `my_cool_table`")

DataFrame[]

In [5]:
columns = ["language", "num_speakers"]
data = [("English", "1.5"), ("Mandarin", "1.1"), ("Hindi", "0.6")]
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(columns)

                                                                                

In [6]:
df.show()

+--------+------------+
|language|num_speakers|
+--------+------------+
| English|         1.5|
|Mandarin|         1.1|
|   Hindi|         0.6|
+--------+------------+



In [7]:
df.write.format("delta").saveAsTable("default.my_cool_table")

                                                                                

22/09/13 10:32:25 ERROR Utils: Aborting task
org.apache.spark.sql.delta.DeltaAnalysisException: Cannot create table ('`default`.`my_cool_table`'). The associated location ('file:/Users/matthew.powers/Documents/code/my_apps/delta-examples/notebooks/pyspark/spark-warehouse/my_cool_table') is not empty but it's not a Delta table
	at org.apache.spark.sql.delta.DeltaErrorsBase.createTableWithNonEmptyLocation(DeltaErrors.scala:2226)
	at org.apache.spark.sql.delta.DeltaErrorsBase.createTableWithNonEmptyLocation$(DeltaErrors.scala:2225)
	at org.apache.spark.sql.delta.DeltaErrors$.createTableWithNonEmptyLocation(DeltaErrors.scala:2293)
	at org.apache.spark.sql.delta.commands.CreateDeltaTableCommand.assertPathEmpty(CreateDeltaTableCommand.scala:248)
	at org.apache.spark.sql.delta.commands.CreateDeltaTableCommand.$anonfun$run$2(CreateDeltaTableCommand.scala:120)
	at org.apache.spark.sql.delta.metering.DeltaLogging.recordFrameProfile(DeltaLogging.scala:139)
	at org.apache.spark.sql.delta.metering.

AnalysisException: Cannot create table ('`default`.`my_cool_table`'). The associated location ('file:/Users/matthew.powers/Documents/code/my_apps/delta-examples/notebooks/pyspark/spark-warehouse/my_cool_table') is not empty but it's not a Delta table

In [None]:
spark.sql("select * from `my_cool_table`").show()

In [None]:
!tree ./spark-warehouse/my_cool_table/

In [None]:
spark.sql("select * from `my_cool_table`").printSchema()

## Drop column from Delta Lake

In [None]:
spark.sql(
    """ALTER TABLE `my_cool_table` SET TBLPROPERTIES (
   'delta.columnMapping.mode' = 'name',
   'delta.minReaderVersion' = '2',
   'delta.minWriterVersion' = '5')"""
)

In [None]:
spark.sql("alter table `my_cool_table` drop column language")

In [None]:
spark.sql("select * from `my_cool_table`").show()

In [None]:
!tree ./spark-warehouse/my_cool_table/

In [None]:
spark.sql("select * from `my_cool_table`").printSchema()

## Drop column from Delta Lake pre Delta 1.2

In [None]:
spark.sql("drop table if exists `another_cool_table`")

In [None]:
columns = ["language", "num_speakers"]
data = [("Spanish", "0.5"), ("French", "0.3"), ("Arabic", "0.3")]
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(columns)

In [None]:
df.write.format("delta").saveAsTable("default.another_cool_table")

In [None]:
df = spark.sql("select * from another_cool_table")

In [None]:
df.show()

In [None]:
%ls -l ./spark-warehouse/another_cool_table/

In [None]:
df = df.drop("num_speakers")

In [None]:
df.show()

In [None]:
df.write.format("delta").mode("OVERWRITE").option(
    "overwriteSchema", "true"
).saveAsTable("default.another_cool_table")

In [None]:
spark.sql("select * from another_cool_table").show()

In [None]:
%ls -l ./spark-warehouse/another_cool_table/

## Cleanup

In [None]:
spark.sql("drop table if exists `my_cool_table`")

In [None]:
spark.sql("drop table if exists `another_cool_table`")