---

# 1Ô∏è‚É£1Ô∏è‚É£ SparkContext (Very Important)

## What is SparkContext?

SparkContext is the entry point to Spark functionality.

It represents:

- Connection to cluster
- Configuration of application
- Resource coordination

In older versions:

```python
from pyspark import SparkContext

sc = SparkContext(appName="MyApp")
```

In modern Spark:

SparkSession internally creates SparkContext.

```python
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("MyApp") \
    .getOrCreate()

sc = spark.sparkContext
```

---

## Responsibilities of SparkContext

- Connects to cluster manager
- Requests executors
- Creates RDDs
- Tracks application metadata
- Distributes tasks
- Manages broadcast variables
- Manages accumulators

---

## SparkContext Architecture View

```
Application Code
       ‚Üì
SparkSession
       ‚Üì
SparkContext
       ‚Üì
Cluster Manager
       ‚Üì
Executors
```

---

## Important SparkContext Concepts

### 1Ô∏è‚É£ Broadcast Variables

Used to send large read-only data to executors efficiently.

```python
broadcast_var = sc.broadcast(large_lookup_dict)
```

---

### 2Ô∏è‚É£ Accumulators

Used for counters across executors.

```python
counter = sc.accumulator(0)
```

---

### 3Ô∏è‚É£ Only One SparkContext Per JVM

You cannot create multiple SparkContexts in the same application.

---

# üöÄ SparkSession vs SparkContext ‚Äî Detailed Interview Guide

---

# 1Ô∏è‚É£ Quick Summary

| Feature | SparkContext | SparkSession |
|----------|--------------|--------------|
| Introduced In | Spark 1.x | Spark 2.x |
| Purpose | Core Spark connection to cluster | Unified entry point to Spark |
| API Type | RDD-based | DataFrame / SQL / Streaming |
| Needed Today? | Yes (internally) | Yes (primary interface) |
| Replaces | ‚Äî | SQLContext, HiveContext, SparkContext (partial) |

---

# 2Ô∏è‚É£ What is SparkContext?

SparkContext is the **original entry point** to Spark (before Spark 2.0).

In older versions of Spark, the main entry point was:

```python
from pyspark import SparkContext

sc = SparkContext("local", "My App")
```

---

## üß± Old Spark Architecture (Pre 2.0)

You had to create and manage multiple contexts:

- `SparkContext`
- `SQLContext`
- `HiveContext`
- `StreamingContext`

This made application development complex.

---

### üîπ What SparkContext Did

- Connected to the cluster
- Resource coordination
- Managed executors
- RDD creation
- Handled RDD operations
- Provided low-level distributed processing
- Task scheduling

---

## üîπ SparkContext Responsibilities

- Connects to Cluster Manager
- Requests executors
- Creates RDDs
- Distributes tasks
- Manages broadcast variables
- Manages accumulators

---

## üîπ Example (Old Style)

```python
from pyspark import SparkContext

sc = SparkContext(appName="MyApp")

rdd = sc.textFile("data.txt")
rdd.count()
```

---

### üîπ Important Notes

- RDD was the main abstraction
- No structured DataFrame API (initially)
- SQL support required separate `SQLContext`
- Hive support required `HiveContext`
- Streaming required `StreamingContext`
- Multiple contexts had to be created manually

---

## üîπ Important Facts

- Only **one SparkContext per JVM**
- If SparkContext stops ‚Üí Application ends
- Core object behind everything

---

# 3Ô∏è‚É£ What is SparkSession?

SparkSession was introduced in Spark 2.0.

It is a **unified entry point** for:

- Spark SQL
- DataFrame API
- Structured Streaming
- Hive support

It internally contains:

- SparkContext
- SQLContext
- HiveContext


When you create a SparkSession:

```python
spark = SparkSession.builder.getOrCreate()
```

It automatically creates a SparkContext internally.

You can access it like this:

```python
sc = spark.sparkContext
print(sc.appName)
```

---

## üîπ Example (Modern Way)

```python
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("MyApp") \
    .getOrCreate()

df = spark.read.csv("data.csv")
df.show()
```

---

## üîπ Why Import from `pyspark.sql`?

Because `SparkSession` belongs to the Spark SQL module, which powers:

- DataFrames
- Spark SQL
- Catalyst Optimizer
- Tungsten Execution Engine

---

# üèó Why SparkSession Was Introduced

To unify all contexts into a single entry point.

Instead of managing:

```
SparkContext
SQLContext
HiveContext
StreamingContext
```

Now we use:

```
SparkSession
```

---

# 4Ô∏è‚É£ Relationship Between SparkSession and SparkContext

Very important for interviews üëá

```
SparkSession
     |
     ‚îî‚îÄ‚îÄ SparkContext
```

SparkSession internally creates SparkContext.

You can access it like this:

```python
sc = spark.sparkContext
```

So:

> SparkSession is a wrapper around SparkContext.

---

# 5Ô∏è‚É£ Internal Architecture View

```
Application Code
       ‚Üì
SparkSession
       ‚Üì
SparkContext
       ‚Üì
Cluster Manager
       ‚Üì
Executors
```

---

# 6Ô∏è‚É£ Why SparkSession Was Introduced?

Before Spark 2.0, we had:

- SparkContext
- SQLContext
- HiveContext

Too many contexts.

SparkSession unified everything into one object.

So instead of:

```python
sc = SparkContext()
sqlContext = SQLContext(sc)
```

Now we just use:

```python
spark = SparkSession.builder.getOrCreate()
```

---

# 7Ô∏è‚É£ When Do You Use SparkContext Directly?

Rare cases:

- RDD-based operations
- Broadcast variables
- Accumulators
- Low-level distributed logic

Example:

```python
broadcast_var = spark.sparkContext.broadcast([1,2,3])
```

---

# 8Ô∏è‚É£ Interview-Ready Explanation

If interviewer asks:

### ‚ùì What is difference between SparkSession and SparkContext?

Answer:

> SparkContext is the core connection to the cluster and is used mainly for RDD operations. SparkSession is the unified entry point introduced in Spark 2.0 that wraps SparkContext and provides APIs for DataFrame, SQL, and Streaming.

---

# 9Ô∏è‚É£ Practical Rule

In modern Spark:

‚úÖ Always create SparkSession  
‚ùå Do not manually create SparkContext  

SparkSession will handle it internally.

---

# üîü Common Interview Trap

Question:

Can we create multiple SparkContexts?

Answer:

‚ùå No. Only one SparkContext per JVM.

But:

You can create multiple SparkSessions using the same SparkContext.

---

# üéØ Final Comparison

| Aspect | SparkContext | SparkSession |
|--------|--------------|--------------|
| Level | Low-level | High-level |
| API | RDD | DataFrame/SQL |
| Introduced | Spark 1.x | Spark 2.x |
| Used Today | Internally | Primary interface |

---

# üöÄ One-Line Memory Trick

SparkContext = Engine  
SparkSession = Dashboard + Engine

---

# üß† Advanced Follow-Up (If Asked)

Interviewer may ask:

- What happens if SparkContext crashes?
- Can SparkSession exist without SparkContext?
- How does SparkSession manage Hive?
- What is getOrCreate() doing internally?

We can cover these next if you want üî•


# üéØ Interview-Ready Answer

**Why did SparkSession replace SparkContext?**

SparkSession was introduced in Spark 2.0 as a unified entry point that combines SparkContext, SQLContext, HiveContext, and StreamingContext into a single object. It simplifies Spark application development and supports optimized structured APIs like DataFrames and Datasets.

---

# ‚ö° Example: Accessing SparkContext from SparkSession

```python
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Demo App") \
    .getOrCreate()

# Access SparkContext
sc = spark.sparkContext

print(sc)
```

---

# üöÄ Key Takeaways

- SparkContext is low-level and RDD-based.
- SparkSession is high-level and structured API based.
- SparkSession internally manages SparkContext.
- Modern Spark development uses SparkSession.
- Always use SparkSession in production applications.

In [0]:
# Default Spark Session from Databricks
spark

In [0]:
# Automatically gets created by Databricks (SparkSession)
sc

# How to create SparkSession and Spark Context manually.

In [0]:
# Create Spark Session
# from pyspark.sql import SparkSession# spark = SparkSession.builder.appName("Spark DataFrames").getOrCreate()

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark Fundamentals").getOrCreate()


spark2 = SparkSession.builder.appName("Spark Fundamentals")\
  .config("spark.sql.warehouse.dir", "file:///databricks/driver/spark-warehouse")\
  .config("spark.sql.shuffle.partitions", "4")\
  .getOrCreate()


In [0]:
spark2

# Understanding SparkSession `.config()` in Spark

---

## Code Example

```python
spark2 = SparkSession.builder \
  .appName("Spark Fundamentals") \
  .config("spark.sql.warehouse.dir", "file:///databricks/driver/spark-warehouse") \
  .config("spark.sql.shuffle.partitions", "4") \
  .getOrCreate()
```

---

# 1Ô∏è‚É£ What is `.config()`?

`.config()` is used to set **Spark configuration properties** while creating a SparkSession.

It allows you to:

- Override default Spark settings
- Tune performance
- Configure storage paths
- Set execution parameters

Internally, these configurations are passed to:

```
SparkConf ‚Üí SparkContext ‚Üí Executors
```

---

# 2Ô∏è‚É£ What is `spark.sql.warehouse.dir`?

```
.config("spark.sql.warehouse.dir", "file:///databricks/driver/spark-warehouse")
```

## üîπ What It Does

This sets the **default warehouse directory** where Spark stores:

- Managed tables
- Hive tables
- Metadata files
- Default database storage

When you run:

```sql
CREATE TABLE test_table (...)
```

Spark stores the table data in the warehouse directory.

---

## üîπ From Where Is This Path Taken?

In Databricks:

```
file:///databricks/driver/spark-warehouse
```

### Breakdown:

- `file://` ‚Üí Local file system
- `/databricks/driver/` ‚Üí Driver node's local disk
- `spark-warehouse` ‚Üí Default warehouse folder

### Important:

This is **Databricks-specific local driver storage path**.

If running locally (non-Databricks), default is usually:

```
file:///user/hive/warehouse
```

---

# 3Ô∏è‚É£ What is `spark.sql.shuffle.partitions`?

```
.config("spark.sql.shuffle.partitions", "4")
```

## üîπ What It Controls

This defines the **number of partitions created during shuffle operations**.

Shuffle happens during:

- groupBy()
- join()
- distinct()
- orderBy()
- repartition()

---

## üîπ Default Value

Default = **200 partitions**

That means:

When you do:

```python
df.groupBy("id").count()
```

Spark will create 200 shuffle partitions by default.

---

## üîπ Why Set It to 4?

In small datasets (like learning or development):

- 200 partitions is too many
- Creates unnecessary small tasks
- Slows down performance

So we reduce it to:

```
4 partitions
```

This makes Spark create only 4 shuffle tasks.

---

# 4Ô∏è‚É£ Important Interview Concept

### What is Shuffle?

Shuffle is the process of:

- Redistributing data across executors
- Based on key (for join/groupBy)

It is:

- Expensive
- Disk + network heavy
- Performance critical

That‚Äôs why tuning `spark.sql.shuffle.partitions` is important.

---

# 5Ô∏è‚É£ How to Check These Configs?

After creating SparkSession:

```python
spark.conf.get("spark.sql.warehouse.dir")
spark.conf.get("spark.sql.shuffle.partitions")
```

Or list all configs:

```python
spark.sparkContext.getConf().getAll()
```

---

# 6Ô∏è‚É£ Summary

| Config | Purpose |
|--------|----------|
| spark.sql.warehouse.dir | Location to store managed tables |
| spark.sql.shuffle.partitions | Number of partitions during shuffle |
| .config() | Used to set Spark runtime properties |

---

# üöÄ Real Production Note

In production environments:

Warehouse path usually points to:

- HDFS
- S3
- ADLS
- DBFS (Databricks File System)

Example:

```
dbfs:/user/hive/warehouse
s3://bucket/warehouse/
abfss://container@storage.dfs.core.windows.net/warehouse/
```

Not local driver storage.

---

# üéØ Final Understanding

`.config()` customizes Spark behavior at runtime.

- Warehouse config ‚Üí Storage location
- Shuffle config ‚Üí Performance tuning

Both are very important for Data Engineering interviews.

In [0]:
spark