## Drop column from Delta Lake table

This notebook demonstrates how to drop a column of a Delta Lake table.

It demonstrates how the column mapping functionality that was added in Delta 1.2 makes this operation a lot more efficient.

In [1]:
import pyspark
from delta import *

In [2]:
builder = (
    pyspark.sql.SparkSession.builder.appName("MyApp")
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config(
        "spark.sql.catalog.spark_catalog",
        "org.apache.spark.sql.delta.catalog.DeltaCatalog",
    )
)

In [3]:
spark = configure_spark_with_delta_pip(builder).getOrCreate()

:: loading settings :: url = jar:file:/Users/matthew.powers/opt/miniconda3/envs/pyspark-330-delta-210/lib/python3.9/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /Users/matthew.powers/.ivy2/cache
The jars for the packages stored in: /Users/matthew.powers/.ivy2/jars
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-e82e1892-0ec6-4af6-bcfb-625cf1a896e4;1.0
	confs: [default]
	found io.delta#delta-core_2.12;2.1.0 in central
	found io.delta#delta-storage;2.1.0 in central
	found org.antlr#antlr4-runtime;4.8 in central
	found org.codehaus.jackson#jackson-core-asl;1.9.13 in central
:: resolution report :: resolve 315ms :: artifacts dl 39ms
	:: modules in use:
	io.delta#delta-core_2.12;2.1.0 from central in [default]
	io.delta#delta-storage;2.1.0 from central in [default]
	org.antlr#antlr4-runtime;4.8 from central in [default]
	org.codehaus.jackson#jackson-core-asl;1.9.13 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number|

22/09/13 11:00:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


## Create Delta Lake

In [4]:
spark.sql("drop table if exists `my_cool_table`")

DataFrame[]

In [5]:
columns = ["language", "num_speakers"]
data = [("English", "1.5"), ("Mandarin", "1.1"), ("Hindi", "0.6")]
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(columns)

                                                                                

In [6]:
df.show()

+--------+------------+
|language|num_speakers|
+--------+------------+
| English|         1.5|
|Mandarin|         1.1|
|   Hindi|         0.6|
+--------+------------+



In [7]:
df.write.format("delta").saveAsTable("default.my_cool_table")

                                                                                

In [9]:
spark.sql("select * from `my_cool_table` WHERE num_speakers > 1.0").show()

+--------+------------+
|language|num_speakers|
+--------+------------+
|Mandarin|         1.1|
| English|         1.5|
+--------+------------+



In [None]:
df = spark.read.format("delta").load()

In [9]:
!tree ./spark-warehouse/my_cool_table/

[01;34m./spark-warehouse/my_cool_table/[0m
├── [01;34m_delta_log[0m
│   └── [00m00000000000000000000.json[0m
├── [00mpart-00000-bf435d9b-669a-46cd-98b8-514b1432b94e-c000.snappy.parquet[0m
├── [00mpart-00003-52618118-4e11-46a4-9c6c-1964344daea4-c000.snappy.parquet[0m
├── [00mpart-00006-29198c40-d614-4afc-918c-6d309936bb9c-c000.snappy.parquet[0m
└── [00mpart-00009-1159b044-5ec2-420d-babc-e92d7dcedf41-c000.snappy.parquet[0m

1 directory, 5 files


In [None]:
spark.sql("select * from `my_cool_table`").printSchema()

## Drop column from Delta Lake

In [None]:
spark.sql(
    """ALTER TABLE `my_cool_table` SET TBLPROPERTIES (
   'delta.columnMapping.mode' = 'name',
   'delta.minReaderVersion' = '2',
   'delta.minWriterVersion' = '5')"""
)

In [None]:
spark.sql("alter table `my_cool_table` drop column language")

In [None]:
spark.sql("select * from `my_cool_table`").show()

In [None]:
!tree ./spark-warehouse/my_cool_table/

In [None]:
spark.sql("select * from `my_cool_table`").printSchema()

## Drop column from Delta Lake pre Delta 1.2

In [None]:
spark.sql("drop table if exists `another_cool_table`")

In [None]:
columns = ["language", "num_speakers"]
data = [("Spanish", "0.5"), ("French", "0.3"), ("Arabic", "0.3")]
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(columns)

In [None]:
df.write.format("delta").saveAsTable("default.another_cool_table")

In [None]:
df = spark.sql("select * from another_cool_table")

In [None]:
df.show()

In [None]:
%ls -l ./spark-warehouse/another_cool_table/

In [None]:
df = df.drop("num_speakers")

In [None]:
df.show()

In [None]:
df.write.format("delta").mode("OVERWRITE").option(
    "overwriteSchema", "true"
).saveAsTable("default.another_cool_table")

In [None]:
spark.sql("select * from another_cool_table").show()

In [None]:
%ls -l ./spark-warehouse/another_cool_table/

## Cleanup

In [None]:
spark.sql("drop table if exists `my_cool_table`")

In [None]:
spark.sql("drop table if exists `another_cool_table`")