1. Write an SQL query to find the second highest salary from an employee table.

2. How do you handle NULL values in SQL joins?

3. Write a Python script to read a CSV file and load it into a DataFrame.

4. How do you handle exceptions in Python using try-except blocks?

5. In PySpark, how would you perform a join operation between two large DataFrames efficiently?

6. Write a PySpark code to find the top 3 customers with the highest revenue per region.

7. What is the difference between partitioning and bucketing in PySpark?

8. How do you implement Slowly Changing Dimensions (SCD) in a data warehouse?

9. Explain the concept of star schema and snowflake schema in data modeling.

10. How would you design a fact table for an e-commerce platform?

11. How do you build an ETL pipeline using Azure Data Factory?

12. What are the different types of triggers in ADF and when to use them?

13. Explain the architecture of Azure Databricks and its integration with Delta Lake.

14. Write a PySpark code to process streaming data from Event Hub in Databricks.

15. How do you optimize query performance in Azure Synapse Analytics?

16. How would you design a data warehouse for a retail business using Synapse?

17. What are the best practices for securing data in Azure Data Lake Storage?

18. How do you manage access control and secrets using Azure Key Vault?

19. Write a PySpark script to load data from ADLS into a Delta table.

20. How do you implement data lineage and governance in Microsoft Purview?

21. Build a real-time analytics pipeline using Event Hub, Stream Analytics, and Synapse.

22. How would you handle late-arriving data in a batch ETL pipeline?

23. Write an SQL query to calculate the customer churn rate over the last 6 months.

24. How do you implement incremental data loading in ADF pipelines?

25. Write a Python script to validate data quality and detect anomalies.

In [None]:
--Write an SQL query to find the second highest salary from an employee table.

with salary_ranking as (
    Select EmployeeId, Salary, DENSE_RANK() OVER ( ORDER BY Salary DESC ) as salary_rank
    from Employee
)
Select COALESCE(Salary, 0)
from salary_ranking
where salary_rank = 2

In [None]:
--How do you handle NULL values in SQL joins?

-- If want to skip nulls

SELECT * 
FROM table_a as a join table_b as b on a.col = b.col

-- If want to include

SELECT * 
FROM table_a as a join table_b as b on ( a.col = b.col or (a.col is null and b.col is null) ) 

-- If want to keep nulls after join, use left join

SELECT * 
FROM table_a as a left join table_b as b on a.col = b.col 
where b.col is null

In [None]:
#Write a Python script to read a CSV file and load it into a DataFrame.

import pandas as pd

def read_file(path: str) -> pd.DataFrame:
    df = pd.read_csv(path, sep=",")
    return pd

if __name__ == "__main__":
    filepath = input("Enter File Path : ").strip()
    pandas_df = read_file(filepath)
    print(pandas_df.head(10))

In [None]:
#How do you handle exceptions in Python using try-except blocks?

if __name__ == "__main__":
    value1 = input("Insert First Value : ")
    value2 = input("insert next value")
    try:
        print(value1 / value2)
    except ValueError as e:
        print("Invalid Number format as : ", e)
    except ZeroDivisionError as e:
        print("Zero Division error as : ", e)
    except Exception as e:
        print(e)
    finally:
        print("done")

- In PySpark, how would you perform a join operation between two large DataFrames efficiently?

    - Default join is sort merge , which is expensive because of shuffling
    - We can use salting to perform the join
    - First we need to check skew of dataframes
    - Then we add salting keys
    - increase number of shuffle partitions
    - Increase parallelism

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import pyspark.sql.functions as F

spark = SparkSession.builder.appName("SkewedJoin").getOrCreate()

# 1. Enable AQE + Skew handling FIRST (handles 80% of cases automatically)
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionFactor", "5")
spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", "256MB")
spark.conf.set("spark.sql.shuffle.partitions", "1000")  # Tune based on cluster size

# 2. Load tables
large_df1 = spark.read.format("delta").table("delta_table_1")
large_df2 = spark.read.format("delta").table("delta_table_2")

# 3. **DETECT SKEW** - Critical first step
def detect_skew(df, key_col, threshold=10):
    skew_stats = df.groupBy(key_col).count().orderBy("count", ascending=False).limit(20)
    skew_stats.show()
    total_rows = df.count()
    max_rows = skew_stats.agg(max("count")).collect()[0][0]
    skew_ratio = max_rows * 100.0 / total_rows if total_rows > 0 else 0
    print(f"Max skew ratio for {key_col}: {skew_ratio:.2f}%")
    return skew_ratio > threshold

skew1 = detect_skew(large_df1, "join_key")
skew2 = detect_skew(large_df2, "join_key")

# 4. Cache if large (AQE handles most spilling automatically)
large_df1.cache()
large_df2.cache()

if skew1 or skew2:
    print("Applying salting due to skew...")
    
    # 5. **CORRECT SALTING** - Generate N salt values (e.g., 10)
    salts = spark.range(1, 11).select(col("id").cast("string").alias("salt"))
    
    # Salt large_df2 (smaller side usually) - explode to create 10x rows
    large_df2_salted = large_df2.crossJoin(salts)\
        .withColumn("salted_key", concat(col("join_key"), lit("_"), col("salt")))
    
    # Salt large_df1 - random salt per row (broadcast eligible if small)
    large_df1_salted = large_df1.withColumn("salt", floor(rand() * 10).cast("string"))\
        .withColumn("salted_key", concat(col("join_key"), lit("_"), col("salt")))
    
    # 6. Repartition (AQE may handle this too)
    large_df1_salted = large_df1_salted.repartition(1000, "salted_key")
    large_df2_salted = large_df2_salted.repartition(1000, "salted_key")
    
    # 7. Join on salted key
    joined_df = large_df1_salted.join(
        large_df2_salted, 
        on="salted_key", 
        how="inner"
    ).drop("salted_key", "salt")
    
else:
    # No skew - direct join (AQE + broadcast if one side small)
    joined_df = large_df1.join(large_df2, "join_key", "inner")

# Verify & cleanup
joined_df.explain()  # Check physical plan
print(f"Join completed. Rows: {joined_df.count()}")
large_df1.unpersist()
large_df2.unpersist()
