### 1. Create a SparkSession in PySpark.

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('pysparkinterviewqs').getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/07/01 22:03:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
spark

### 2. Read a CSV file into a DataFrame using PySpark.

In [5]:
df = spark.read.csv('employees.csv')

In [6]:
df

DataFrame[_c0: string, _c1: string, _c2: string, _c3: string, _c4: string, _c5: string, _c6: string, _c7: string]

In [7]:
df.show()

+----------+------+----------+---------------+------+-------+-----------------+--------------------+
|       _c0|   _c1|       _c2|            _c3|   _c4|    _c5|              _c6|                 _c7|
+----------+------+----------+---------------+------+-------+-----------------+--------------------+
|First Name|Gender|Start Date|Last Login Time|Salary|Bonus %|Senior Management|                Team|
|   Douglas|  Male|  8/6/1993|       12:42 PM| 97308|  6.945|             true|           Marketing|
|    Thomas|  Male| 3/31/1996|        6:53 AM| 61933|   4.17|             true|                NULL|
|     Maria|Female| 4/23/1993|       11:17 AM|130590| 11.858|            false|             Finance|
|     Jerry|  Male|  3/4/2005|        1:00 PM|138705|   9.34|             true|             Finance|
|     Larry|  Male| 1/24/1998|        4:47 PM|101004|  1.389|             true|     Client Services|
|    Dennis|  Male| 4/18/1987|        1:35 AM|115163| 10.125|            false|            

In [12]:
df = spark.read.option('header',True).csv('employees.csv')

In [13]:
df.show()

+----------+------+----------+---------------+------+-------+-----------------+--------------------+
|First Name|Gender|Start Date|Last Login Time|Salary|Bonus %|Senior Management|                Team|
+----------+------+----------+---------------+------+-------+-----------------+--------------------+
|   Douglas|  Male|  8/6/1993|       12:42 PM| 97308|  6.945|             true|           Marketing|
|    Thomas|  Male| 3/31/1996|        6:53 AM| 61933|   4.17|             true|                NULL|
|     Maria|Female| 4/23/1993|       11:17 AM|130590| 11.858|            false|             Finance|
|     Jerry|  Male|  3/4/2005|        1:00 PM|138705|   9.34|             true|             Finance|
|     Larry|  Male| 1/24/1998|        4:47 PM|101004|  1.389|             true|     Client Services|
|    Dennis|  Male| 4/18/1987|        1:35 AM|115163| 10.125|            false|               Legal|
|      Ruby|Female| 8/17/1987|        4:20 PM| 65476| 10.012|             true|            

In [14]:
df.count()

1000

## 3. Show the schema of a DataFrame in PySpark.

In [17]:
df.printSchema()

root
 |-- First Name: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Start Date: string (nullable = true)
 |-- Last Login Time: string (nullable = true)
 |-- Salary: string (nullable = true)
 |-- Bonus %: string (nullable = true)
 |-- Senior Management: string (nullable = true)
 |-- Team: string (nullable = true)



In [34]:
from pyspark.sql.types import StructType,StructField,StringType,IntegerType,DateType,BooleanType,FloatType

In [40]:
schema = StructType([
    StructField('First Name',StringType(), False),
    StructField('Gender',StringType(), True),
    StructField('Start Date',StringType(),False),
    StructField('Last Login Time', StringType(), True),
    StructField('Salary', IntegerType(), True),
    StructField('Bonus %',FloatType(),True),
    StructField('Senior Management', BooleanType(),True),
    StructField('Team',StringType(),True)
]
)

In [41]:
df = spark.read\
    .option('header',True)\
    .schema(schema)\
    .csv('employees.csv')

In [42]:
df.show(10)

+----------+------+----------+---------------+------+-------+-----------------+--------------------+
|First Name|Gender|Start Date|Last Login Time|Salary|Bonus %|Senior Management|                Team|
+----------+------+----------+---------------+------+-------+-----------------+--------------------+
|   Douglas|  Male|  8/6/1993|       12:42 PM| 97308|  6.945|             true|           Marketing|
|    Thomas|  Male| 3/31/1996|        6:53 AM| 61933|   4.17|             true|                NULL|
|     Maria|Female| 4/23/1993|       11:17 AM|130590| 11.858|            false|             Finance|
|     Jerry|  Male|  3/4/2005|        1:00 PM|138705|   9.34|             true|             Finance|
|     Larry|  Male| 1/24/1998|        4:47 PM|101004|  1.389|             true|     Client Services|
|    Dennis|  Male| 4/18/1987|        1:35 AM|115163| 10.125|            false|               Legal|
|      Ruby|Female| 8/17/1987|        4:20 PM| 65476| 10.012|             true|            

In [43]:
df.printSchema()

root
 |-- First Name: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Start Date: string (nullable = true)
 |-- Last Login Time: string (nullable = true)
 |-- Salary: integer (nullable = true)
 |-- Bonus %: float (nullable = true)
 |-- Senior Management: boolean (nullable = true)
 |-- Team: string (nullable = true)



## 4. Select specific columns from a DataFrame in PySpark.

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName('spark-test').getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/07/09 15:32:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
from pyspark.sql import Row

In [6]:
df = spark.createDataFrame(
[
    Row(id=1,name='anju',age=20,gender='F',dob='12-20-2000',department='hr'),
    Row(id=2,name='liya',age=15,gender='F',dob='12-20-1990',department='hr'),
    Row(id=3,name='hari',age=30,gender='M',dob='12-20-2005',department='tech'),
    Row(id=4,name='hari',age=30,gender='M',dob='12-20-2005',department='tech'),
    Row(id=5,name='chen',age=15,gender='M',dob='12-20-2005',department='tech'),
    Row(id=6,name='chen',age=15,gender='M',dob='12-20-2005',department='business'),
    Row(id=7,name='chen',age=15,gender='M',dob='12-20-2005',department='business'),
    Row(id=8,name='chen',age=15,gender='M',dob='12-20-2005',department='consulting'),
]
)

In [7]:
df.show()

                                                                                

+---+----+---+------+----------+----------+
| id|name|age|gender|       dob|department|
+---+----+---+------+----------+----------+
|  1|anju| 20|     F|12-20-2000|        hr|
|  2|liya| 15|     F|12-20-1990|        hr|
|  3|hari| 30|     M|12-20-2005|      tech|
|  4|hari| 30|     M|12-20-2005|      tech|
|  5|chen| 15|     M|12-20-2005|      tech|
|  6|chen| 15|     M|12-20-2005|  business|
|  7|chen| 15|     M|12-20-2005|  business|
|  8|chen| 15|     M|12-20-2005|consulting|
+---+----+---+------+----------+----------+



In [8]:
df.select(['name','department']).show()

+----+----------+
|name|department|
+----+----------+
|anju|        hr|
|liya|        hr|
|hari|      tech|
|hari|      tech|
|chen|      tech|
|chen|  business|
|chen|  business|
|chen|consulting|
+----+----------+



## 5. Filter rows based on a condition in PySpark DataFrame.

In [11]:
df.filter(df.department=='hr').show()

+---+----+---+------+----------+----------+
| id|name|age|gender|       dob|department|
+---+----+---+------+----------+----------+
|  1|anju| 20|     F|12-20-2000|        hr|
|  2|liya| 15|     F|12-20-1990|        hr|
+---+----+---+------+----------+----------+



In [12]:
df.where(df.age > 15).show()

+---+----+---+------+----------+----------+
| id|name|age|gender|       dob|department|
+---+----+---+------+----------+----------+
|  1|anju| 20|     F|12-20-2000|        hr|
|  3|hari| 30|     M|12-20-2005|      tech|
|  4|hari| 30|     M|12-20-2005|      tech|
+---+----+---+------+----------+----------+



### 6. Group by a column and perform an aggregation in PySpark.

In [13]:
df.show()

+---+----+---+------+----------+----------+
| id|name|age|gender|       dob|department|
+---+----+---+------+----------+----------+
|  1|anju| 20|     F|12-20-2000|        hr|
|  2|liya| 15|     F|12-20-1990|        hr|
|  3|hari| 30|     M|12-20-2005|      tech|
|  4|hari| 30|     M|12-20-2005|      tech|
|  5|chen| 15|     M|12-20-2005|      tech|
|  6|chen| 15|     M|12-20-2005|  business|
|  7|chen| 15|     M|12-20-2005|  business|
|  8|chen| 15|     M|12-20-2005|consulting|
+---+----+---+------+----------+----------+



In [14]:
df.groupBy(df.department).count().show()

+----------+-----+
|department|count|
+----------+-----+
|        hr|    2|
|      tech|    3|
|  business|    2|
|consulting|    1|
+----------+-----+



## 7. Join two DataFrames in PySpark.

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName('pysparkjoin').getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/07/10 15:23:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
from pyspark.sql import Row


In [5]:
df = spark.createDataFrame([
    Row(id=1,name='anju',age=20,gender='F',dob='12-20-2000',department='hr'),
    Row(id=2,name='liya',age=15,gender='F',dob='12-20-1990',department='hr'),
    Row(id=3,name='hari',age=30,gender='M',dob='12-20-2005',department='tech'),
    Row(id=4,name='hari',age=30,gender='M',dob='12-20-2005',department='tech'),
    Row(id=5,name='chen',age=15,gender='M',dob='12-20-2005',department='tech'),
    Row(id=6,name='chen',age=15,gender='M',dob='12-20-2005',department='business'),
    Row(id=7,name='chen',age=15,gender='M',dob='12-20-2005',department='business'),
    Row(id=8,name='chen',age=15,gender='M',dob='12-20-2005',department='consulting'),
])

In [6]:
df.show()

                                                                                

+---+----+---+------+----------+----------+
| id|name|age|gender|       dob|department|
+---+----+---+------+----------+----------+
|  1|anju| 20|     F|12-20-2000|        hr|
|  2|liya| 15|     F|12-20-1990|        hr|
|  3|hari| 30|     M|12-20-2005|      tech|
|  4|hari| 30|     M|12-20-2005|      tech|
|  5|chen| 15|     M|12-20-2005|      tech|
|  6|chen| 15|     M|12-20-2005|  business|
|  7|chen| 15|     M|12-20-2005|  business|
|  8|chen| 15|     M|12-20-2005|consulting|
+---+----+---+------+----------+----------+



In [None]:

8. Rename columns in a PySpark DataFrame.
9. Handle missing or null values in PySpark DataFrame.
10. Create a new column derived from existing columns in PySpark DataFrame.
11. Remove duplicate rows from a PySpark DataFrame.
12. Sort a DataFrame based on one or multiple columns in PySpark.
13. Perform a simple arithmetic operation on DataFrame columns in PySpark.
14. Calculate descriptive statistics for numeric columns in PySpark.
15. Apply user-defined functions (UDF) on PySpark DataFrame.
16. Convert a PySpark DataFrame to Pandas DataFrame.
17. Write a PySpark DataFrame to a CSV file.
18. Cache or persist a PySpark DataFrame for better performance.
19. Handle broadcast joins in PySpark.
20. Perform window functions in PySpark (e.g., rank, row number, etc.).
21. Handle nested structures or arrays in PySpark DataFrame.
22. Handle time-series data in PySpark.
23. Calculate the correlation between columns in a PySpark DataFrame.
24. Create a pivot table in PySpark.
25. Perform cross-tabulation (crosstab) in PySpark.
26. Handle large-scale data using PySpark (memory management, optimizations).
27. Handle skewed data in PySpark.
28. Perform machine learning tasks (e.g., regression, classification) using PySpark MLlib.
29. Optimize PySpark jobs for performance (tuning configurations, parallelism, etc.).
30. Handle different file formats (Parquet, Avro, ORC) in PySpark.
31.collect list and collect set
32. count and distinct
33. json to dataframe and process
34. https://www.datacamp.com/blog/pyspark-interview-questions

In [None]:
1. Data Processing Optimization: How would you optimize a Spark job that processes 1 TB of data daily to reduce execution time and cost?

2. Handling Skewed Data: In a Spark job, one partition is taking significantly longer to process due to skewed data. How would you handle this situation?

3. Streaming Data Pipeline: Describe how you would set up a real-time data pipeline using Spark Structured Streaming to process and analyze clickstream data from a website.

4. Fault Tolerance: How does Spark handle node failures during a job, and what strategies would you use to ensure data processing continues smoothly?

5. Data Join Strategies: You need to join two large datasets in Spark, but you encounter memory issues. What strategies would you employ to handle this?

6. Checkpointing: Explain the role of checkpointing in Spark Streaming and how you would implement it in a real-time application.

7. Stateful Processing: Describe a scenario where you would use stateful processing in Spark Streaming and how you would implement it.

8. Performance Tuning: What are the key parameters you would tune in Spark to improve the performance of a real-time analytics application?

9. Window Operations: How would you use window operations in Spark Streaming to compute rolling averages over a sliding window of events?

10. Handling Late Data: In a Spark Streaming job, how would you handle late-arriving data to ensure accurate results?

11. Integration with Kafka: Describe how you would integrate Spark Streaming with Apache Kafka to process real-time data streams.

12. Backpressure Handling: How does Spark handle backpressure in a streaming application, and what configurations can you use to manage it?

13. Data Deduplication: How would you implement data deduplication in a Spark Streaming job to ensure unique records?

14. Cluster Resource Management: How would you manage cluster resources effectively to run multiple concurrent Spark jobs without contention?

15. Real-Time ETL: Explain how you would design a real-time ETL pipeline using Spark to ingest, transform, and load data into a data warehouse.

16. Handling Large Files: You have a hashtag#Spark job that needs to process very large files (e.g., 100 GB). How would you optimize the job to handle such files efficiently?

17. Monitoring and Debugging: What tools and techniques would you use to monitor and debug a Spark job running in production?

18. Delta Lake: How would you use Delta Lake with Spark to manage real-time data lakes and ensure data consistency?

19. Partitioning Strategy: How you would design an effective partitioning strategy for a large dataset.

20. Data Serialization: What serialization formats would you use in Spark for real-time data processing, and why?