# Unit 7: Other miscellanrous features 
In this unit, we will learn about other miscellaneous features and use the CoW table we created earlier.<br>



This unit takes about 5 minutes to complete.

### Initialize Spark Session

In [1]:
from pyspark.sql.functions import lit
from functools import reduce
from pyspark.sql.types import LongType
import pyspark.sql.functions as F
from datetime import datetime

spark = SparkSession.builder \
  .appName("Hudi-Learning-Unit-07-PySpark") \
  .master("yarn") \
  .enableHiveSupport() \
  .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
  .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.hudi.catalog.HoodieCatalog") \
  .config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension") \
  .getOrCreate()

spark

23/08/01 18:48:13 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


### Variables

In [2]:
PROJECT_ID_OUTPUT=!gcloud config get-value core/project
PROJECT_ID=PROJECT_ID_OUTPUT[0]
PROJECT_NBR_OUTPUT=!gcloud projects describe $PROJECT_ID --format="value(projectNumber)"
PROJECT_NBR=PROJECT_NBR_OUTPUT[0]

TRIP_DATE='2021-05-25'
LOCATION="us-central1"
HUDI_COW_BASE_GCS_URI = f"gs://gaia_data_bucket-{PROJECT_NBR}/nyc-taxi-trips-hudi-cow"
DATAPROC_METASTORE_THRIFT_URI_LIST=!gcloud metastore services list --location $LOCATION | grep thrift | cut -d' ' -f11
DATAPROC_METASTORE_THRIFT_URI=DATAPROC_METASTORE_THRIFT_URI_LIST[0]

print(f"Project ID is {PROJECT_ID}")
print(f"Project number is {PROJECT_NBR}")
print(f"Project location is {LOCATION}")
print(f"Hudi base Cow table GCS URI is {HUDI_COW_BASE_GCS_URI}")
print(f"Dataproc Metastore Service thrift URI is {DATAPROC_METASTORE_THRIFT_URI}")
print(f"Trip date to be used for deletes is {TRIP_DATE}")

Project ID is apache-hudi-lab
Project number is 623600433888
Project location is us-central1
Hudi base Cow table GCS URI is gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow
Dataproc Metastore Service thrift URI is thrift://10.60.192.28:9080
Trip date to be used for deletes is 2021-05-25


## 1. Review "Insert Overwrite" feature

"Insert Overwrite" operation can be faster than upsert for batch ETL jobs, that are recomputing entire target partitions at once (as opposed to incrementally updating the target tables). This is because, we are able to bypass indexing, precombining and other repartitioning steps in the upsert write path completely.<br><br>

Read the documentation/sample at this link:<br>
https://hudi.apache.org/docs/quick-start-guide#insert-overwrite

## 2. Review "Alter Table" commands

Exercises for you:<br>
1. Rename the cow table to something else. 
2. Add a column
3. Set table properties to keep only past 5 commits

Commands are at:<br>
https://hudi.apache.org/docs/quick-start-guide#alter-table

## 3. Review partition commands

Exercises for you:<br>
1. Run a "show partitions" command
2. Drop the partition trip_date='2021-12-31'

Commands are at:<br>
https://hudi.apache.org/docs/quick-start-guide#partition-sql-command

This concludes the unit, please proceed to the next notebook.