### Lab 1: Setting Up and Exploring PySpark Environment

## Tasks

1. **Setup and Explore PySpark Environment**
    - Install and configure PySpark.
    - Verify the installation.
    - Configure the environment variables for Java and Spark.

2. **Initialize PySpark**
    - Initialize a `SparkSession`.
    - Access the `SparkContext`.

3. **Create and Explore RDDs**
    - Create RDDs from sample telecom data.
    - Perform basic RDD transformations and actions:
      - Map Transformation
      - Filter Transformation
      - FlatMap Transformation
      - Distinct Transformation
      - Union Transformation
      - Sample Transformation
      - Collect Action
      - Count Action
      - Take Action

4. **Case Study: Identification of Gold Plan Customers**
    - Create RDDs from updated telecom and user data.
    - Perform join operations to combine call data with user data.
    - Filter for Gold plan users.
    - Filter for outgoing calls.
    - Identify Gold users with more than 10-minute outgoing calls in December 2024.

5. **Cleanup**
    - Stop the `SparkContext`.

## 1. Installing and Configuring PySpark

### Step 1: Install PySpark
PySpark can be installed via pip, which is the simplest way to set it up in Python environments.

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://dlcdn.apache.org/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3.tgz

!tar xzf spark-3.5.3-bin-hadoop3.tgz

In [None]:
!pip install pyspark
!pip install -q findspark



### Step 2: Verify Installation
Ensure PySpark is installed by checking its version.

In [None]:
import pyspark
print("PySpark Version:", pyspark.__version__)

PySpark Version: 3.5.3


### Step 3: Environment Configuration
PySpark requires Java for execution. Ensure Java 8 or 11 is installed.
If you encounter configuration issues, set the environment variables explicitly.

In [None]:
import os
os.environ["JAVA_HOME"] = "/path/to/java"  # Replace with your Java path
os.environ["SPARK_HOME"] = "/path/to/spark"  # Replace with your Spark path

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.3-bin-hadoop3"

### **Option: Setup for Google Colab**



```
  !apt-get install openjdk-8-jdk-headless -qq > /dev/null

  !wget -q https://dlcdn.apache.org/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3.tgz

  !tar xzf spark-3.5.3-bin-hadoop3.tgz
```



```
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.3-bin-hadoop3"
```




## 2. Exploring the PySpark Interface

### Step 1: Initialize a SparkSession
The SparkSession is the entry point to PySpark. It manages configurations and resources.



### Installation

- **Java:** PySpark requires a Java Development Kit (JDK).  The code uses `apt-get` (on Linux systems like those found in Google Colab) to install OpenJDK 8.  Ensure Java is correctly installed and configured on your system.
- **Spark:** The code downloads a pre-built Spark distribution from Apache's website. Adjust the URL if needed to download the correct version for your use case.
- **PySpark and findspark:** The `pip` command installs the PySpark Python library and the `findspark` package. `findspark` simplifies the process of finding the Spark installation within your Python environment.


In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PySpark Exploration") \
    .getOrCreate()

print("SparkSession Created")

SparkSession Created


### Step 2: Access SparkContext
The SparkContext allows interaction with the underlying cluster and execution engine.

In [None]:
sc = spark.sparkContext
print("SparkContext Initialized")
print("Application Name:", sc.appName)

SparkContext Initialized
Application Name: PySpark Exploration


## 3. Running Basic Operations


### PySpark RDDs

### What are RDDs?

Resilient Distributed Datasets (RDDs) are the fundamental data structure in Spark.  They represent a collection of elements partitioned across a cluster of machines.  RDDs are immutable, meaning once created, their contents cannot be changed.  Instead of modifying an RDD, you create a new RDD based on transformations applied to the original.  This immutability allows Spark to efficiently manage and optimize data processing.

### Key Characteristics of RDDs:

* **Immutability:** RDDs are read-only. Transformations create new RDDs.
* **Fault Tolerance:** RDDs are resilient to failures. If a node fails, Spark can reconstruct the lost data from its lineage (the sequence of transformations that led to the RDD).
* **Partitioning:** RDDs are divided into partitions, which are distributed across the cluster. This allows for parallel processing.
* **Lazy Evaluation:** Transformations are not executed immediately. Instead, they are recorded as a directed acyclic graph (DAG) of operations.  Actions trigger the execution of the entire DAG.
* **Persistence (Caching):**  You can persist an RDD in memory or on disk to speed up subsequent operations that require the same data.

### Creating RDDs:

There are two primary ways to create RDDs:

1. **From a file:**
   ```python
   rdd = sc.textFile("path/to/file.txt")
   ```

2. **From a Python collection:**
   ```python
   data = [1, 2, 3, 4, 5]
   rdd = sc.parallelize(data)
   ```


### RDD Transformations:

Transformations create new RDDs from existing ones.  Examples include:

* **`map(func)`:** Applies a function to each element.
* **`filter(func)`:** Filters elements based on a condition.
* **`flatMap(func)`:** Applies a function that returns multiple elements for each input element.
* **`distinct()`:** Removes duplicate elements.
* **`union(otherRDD)`:** Combines two RDDs.
* **`intersection(otherRDD)`:** Returns the common elements of two RDDs.
* **`subtract(otherRDD)`:** Returns elements in the first RDD that are not in the second.
* **`join(otherRDD)`:** Performs an inner join of two RDDs based on a key.
* **`cogroup(otherRDD)`:** Groups elements from two RDDs based on a key.
* **`reduceByKey(func)`:** Reduces values by key.
* **`sortByKey()`:** Sorts elements by key.
* **`groupByKey()`:** Groups elements by key.

### RDD Actions:

Actions trigger the execution of transformations and return results to the driver program.  Examples include:

* **`collect()`:** Returns all elements as a Python list.
* **`count()`:** Returns the number of elements.
* **`first()`:** Returns the first element.
* **`take(n)`:** Returns the first `n` elements.
* **`reduce(func)`:** Reduces elements to a single value.
* **`saveAsTextFile("path")`:** Saves the RDD to a text file.
* **`countByKey()`:** Returns counts of elements by key.



### Step 1: Create an RDD (Resilient Distributed Dataset)
An RDD is the fundamental distributed data structure in PySpark.

In [None]:
# Import necessary libraries
from pyspark import SparkContext


# Sample Telecom Data (Call Records and User Data)
call_data = [
    ("user1", "2024-12-19", "IN", 5),
    ("user2", "2024-12-19", "OUT", 15),
    ("user3", "2024-12-19", "IN", 7),
    ("user1", "2024-12-20", "OUT", 20),
    ("user2", "2024-12-20", "IN", 10),
    ("user3", "2024-12-20", "OUT", 30),
    ("user4", "2024-12-20", "IN", 12),
    ("user5", "2024-12-19", "OUT", 25)
]

In [None]:
# Create RDDs
call_rdd = sc.parallelize(call_data)
call_rdd

ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:289

# 1. Basic RDD Transformations and Actions


In [None]:
print("Sample Call Data:", call_rdd.take(3))  # Display first 3 rows

Sample Call Data: [('user1', '2024-12-19', 'IN', 5), ('user2', '2024-12-19', 'OUT', 15), ('user3', '2024-12-19', 'IN', 7)]


# Map Transformation: Extracting call durations


In [None]:
call_durations = call_rdd.map(lambda x: x[3])
print("Call Durations:", call_durations.collect())

Call Durations: [5, 15, 7, 20, 10, 30, 12, 25]


In [None]:
# Example 1: Extract call types
call_types = call_rdd.map(lambda x: x[2])
print("Call Types:", call_types.collect())

Call Types: ['IN', 'OUT', 'IN', 'OUT', 'IN', 'OUT', 'IN', 'OUT']


In [None]:
# Example 2: Extract user IDs
user_ids = call_rdd.map(lambda x: x[0])
print("User IDs:", user_ids.collect())

User IDs: ['user1', 'user2', 'user3', 'user1', 'user2', 'user3', 'user4', 'user5']


In [None]:
# Example 3: Add a constant value to call durations
incremented_durations = call_rdd.map(lambda x: (x[0], x[1], x[2], x[3] + 5))
print("Incremented Durations:", incremented_durations.collect())

Incremented Durations: [('user1', '2024-12-19', 'IN', 10), ('user2', '2024-12-19', 'OUT', 20), ('user3', '2024-12-19', 'IN', 12), ('user1', '2024-12-20', 'OUT', 25), ('user2', '2024-12-20', 'IN', 15), ('user3', '2024-12-20', 'OUT', 35), ('user4', '2024-12-20', 'IN', 17), ('user5', '2024-12-19', 'OUT', 30)]


# Filter Transformation: Calls greater than 10 minutes


In [None]:
long_calls = call_rdd.filter(lambda x: x[3] > 10)
print("Long Calls:", long_calls.collect())

Long Calls: [('user2', '2024-12-19', 'OUT', 15), ('user1', '2024-12-20', 'OUT', 20), ('user3', '2024-12-20', 'OUT', 30), ('user4', '2024-12-20', 'IN', 12), ('user5', '2024-12-19', 'OUT', 25)]


In [None]:
# Example 1: Filter incoming calls
incoming_calls = call_rdd.filter(lambda x: x[2] == "IN")
print("Incoming Calls:", incoming_calls.collect())

Incoming Calls: [('user1', '2024-12-19', 'IN', 5), ('user3', '2024-12-19', 'IN', 7), ('user2', '2024-12-20', 'IN', 10), ('user4', '2024-12-20', 'IN', 12)]


In [None]:
# Example 2: Filter outgoing calls
outgoing_calls = call_rdd.filter(lambda x: x[2] == "OUT")
print("Outgoing Calls:", outgoing_calls.collect())

Outgoing Calls: [('user2', '2024-12-19', 'OUT', 15), ('user1', '2024-12-20', 'OUT', 20), ('user3', '2024-12-20', 'OUT', 30), ('user5', '2024-12-19', 'OUT', 25)]


In [None]:
# Example 3: Filter calls made on a specific date
date_specific_calls = call_rdd.filter(lambda x: x[1] == "2024-12-20")
print("Calls on 2024-12-20:", date_specific_calls.collect())

Calls on 2024-12-20: [('user1', '2024-12-20', 'OUT', 20), ('user2', '2024-12-20', 'IN', 10), ('user3', '2024-12-20', 'OUT', 30), ('user4', '2024-12-20', 'IN', 12)]


# FlatMap Transformation: Split user IDs into individual characters

In [None]:
split_user_ids = call_rdd.flatMap(lambda x: list(x[0]))
print("Split User IDs:", split_user_ids.collect())

Split User IDs: ['u', 's', 'e', 'r', '1', 'u', 's', 'e', 'r', '2', 'u', 's', 'e', 'r', '3', 'u', 's', 'e', 'r', '1', 'u', 's', 'e', 'r', '2', 'u', 's', 'e', 'r', '3', 'u', 's', 'e', 'r', '4', 'u', 's', 'e', 'r', '5']


In [None]:
# Example 1: Duplicate each record
duplicated_records = call_rdd.flatMap(lambda x: [x, x])
print("Duplicated Records:", duplicated_records.collect())

Duplicated Records: [('user1', '2024-12-19', 'IN', 5), ('user1', '2024-12-19', 'IN', 5), ('user2', '2024-12-19', 'OUT', 15), ('user2', '2024-12-19', 'OUT', 15), ('user3', '2024-12-19', 'IN', 7), ('user3', '2024-12-19', 'IN', 7), ('user1', '2024-12-20', 'OUT', 20), ('user1', '2024-12-20', 'OUT', 20), ('user2', '2024-12-20', 'IN', 10), ('user2', '2024-12-20', 'IN', 10), ('user3', '2024-12-20', 'OUT', 30), ('user3', '2024-12-20', 'OUT', 30), ('user4', '2024-12-20', 'IN', 12), ('user4', '2024-12-20', 'IN', 12), ('user5', '2024-12-19', 'OUT', 25), ('user5', '2024-12-19', 'OUT', 25)]


In [None]:
# Example 2: Split call types into characters
split_call_types = call_rdd.flatMap(lambda x: list(x[2]))
print("Split Call Types:", split_call_types.collect())

Split Call Types: ['I', 'N', 'O', 'U', 'T', 'I', 'N', 'O', 'U', 'T', 'I', 'N', 'O', 'U', 'T', 'I', 'N', 'O', 'U', 'T']


In [None]:
# Example 3: Generate ranges from call durations
ranges = call_rdd.flatMap(lambda x: range(1, x[3] + 1))
print("Generated Ranges:", ranges.collect())

Generated Ranges: [1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]


# Distinct Transformation: Remove duplicate user IDs

In [None]:
unique_user_ids = user_ids.distinct()
print("Unique User IDs:", unique_user_ids.collect())

Unique User IDs: ['user1', 'user2', 'user4', 'user5', 'user3']


In [None]:
# Example 1: Get distinct call types
distinct_call_types = call_types.distinct()
print("Distinct Call Types:", distinct_call_types.collect())

Distinct Call Types: ['OUT', 'IN']


In [None]:
# Example 2: Combine call types and durations for distinct pairs
distinct_pairs = call_rdd.map(lambda x: (x[2], x[3])).distinct()
print("Distinct Call Type-Duration Pairs:", distinct_pairs.collect())

Distinct Call Type-Duration Pairs: [('IN', 5), ('IN', 7), ('OUT', 20), ('OUT', 30), ('OUT', 15), ('IN', 10), ('IN', 12), ('OUT', 25)]


# Union Transformation: Combine RDDs


In [None]:
additional_data = [
    ("user6", "2024-12-21", "IN", 8),
    ("user7", "2024-12-21", "OUT", 14)
]
additional_rdd = sc.parallelize(additional_data)
combined_rdd = call_rdd.union(additional_rdd)
print("Combined RDD:", combined_rdd.collect())

Combined RDD: [('user1', '2024-12-19', 'IN', 5), ('user2', '2024-12-19', 'OUT', 15), ('user3', '2024-12-19', 'IN', 7), ('user1', '2024-12-20', 'OUT', 20), ('user2', '2024-12-20', 'IN', 10), ('user3', '2024-12-20', 'OUT', 30), ('user4', '2024-12-20', 'IN', 12), ('user5', '2024-12-19', 'OUT', 25), ('user6', '2024-12-21', 'IN', 8), ('user7', '2024-12-21', 'OUT', 14)]


In [None]:
# Example 1: Union with an empty RDD
empty_rdd = sc.parallelize([])
union_with_empty = call_rdd.union(empty_rdd)
print("Union with Empty RDD:", union_with_empty.collect())

Union with Empty RDD: [('user1', '2024-12-19', 'IN', 5), ('user2', '2024-12-19', 'OUT', 15), ('user3', '2024-12-19', 'IN', 7), ('user1', '2024-12-20', 'OUT', 20), ('user2', '2024-12-20', 'IN', 10), ('user3', '2024-12-20', 'OUT', 30), ('user4', '2024-12-20', 'IN', 12), ('user5', '2024-12-19', 'OUT', 25)]


In [None]:
# Example 2: Union of RDD with itself
duplicate_union = call_rdd.union(call_rdd)
print("Union with Itself:", duplicate_union.collect())

Union with Itself: [('user1', '2024-12-19', 'IN', 5), ('user2', '2024-12-19', 'OUT', 15), ('user3', '2024-12-19', 'IN', 7), ('user1', '2024-12-20', 'OUT', 20), ('user2', '2024-12-20', 'IN', 10), ('user3', '2024-12-20', 'OUT', 30), ('user4', '2024-12-20', 'IN', 12), ('user5', '2024-12-19', 'OUT', 25), ('user1', '2024-12-19', 'IN', 5), ('user2', '2024-12-19', 'OUT', 15), ('user3', '2024-12-19', 'IN', 7), ('user1', '2024-12-20', 'OUT', 20), ('user2', '2024-12-20', 'IN', 10), ('user3', '2024-12-20', 'OUT', 30), ('user4', '2024-12-20', 'IN', 12), ('user5', '2024-12-19', 'OUT', 25)]


# Sample Transformation: Randomly sample records


In [None]:
sampled_records = call_rdd.sample(False, 0.5, seed=42)
print("Sampled Records:", sampled_records.collect())

Sampled Records: [('user1', '2024-12-19', 'IN', 5), ('user2', '2024-12-19', 'OUT', 15), ('user1', '2024-12-20', 'OUT', 20)]


In [None]:
# Example 1: Sample with replacement
sampled_with_replacement = call_rdd.sample(True, 0.3, seed=42)
print("Sampled with Replacement:", sampled_with_replacement.collect())

Sampled with Replacement: [('user4', '2024-12-20', 'IN', 12), ('user4', '2024-12-20', 'IN', 12), ('user4', '2024-12-20', 'IN', 12)]


In [None]:
# Example 2: Sample with a higher fraction
sampled_high_fraction = call_rdd.sample(False, 0.8, seed=42)
print("Sampled with High Fraction:", sampled_high_fraction.collect())

Sampled with High Fraction: [('user1', '2024-12-19', 'IN', 5), ('user2', '2024-12-19', 'OUT', 15), ('user3', '2024-12-19', 'IN', 7), ('user1', '2024-12-20', 'OUT', 20), ('user2', '2024-12-20', 'IN', 10), ('user3', '2024-12-20', 'OUT', 30)]


# Collect Action: Fetch all records


In [None]:
all_records = call_rdd.collect()
print("All Records:", all_records)

All Records: [('user1', '2024-12-19', 'IN', 5), ('user2', '2024-12-19', 'OUT', 15), ('user3', '2024-12-19', 'IN', 7), ('user1', '2024-12-20', 'OUT', 20), ('user2', '2024-12-20', 'IN', 10), ('user3', '2024-12-20', 'OUT', 30), ('user4', '2024-12-20', 'IN', 12), ('user5', '2024-12-19', 'OUT', 25)]


# Count Action: Count the number of records


In [None]:
total_count = call_rdd.count()
print("Total Count of Records:", total_count)

Total Count of Records: 8


In [None]:
# Example 1: Count distinct user IDs
total_users = unique_user_ids.count()
print("Total Unique Users:", total_users)

Total Unique Users: 5


In [None]:
# Example 2: Count incoming calls
incoming_call_count = incoming_calls.count()
print("Incoming Call Count:", incoming_call_count)

Incoming Call Count: 4


In [None]:
# Example 3: Count outgoing calls
outgoing_call_count = outgoing_calls.count()
print("Outgoing Call Count:", outgoing_call_count)

Outgoing Call Count: 4


# Take Action: Fetch the first few records

In [None]:
first_two_records = call_rdd.take(2)
print("First Two Records:", first_two_records)

First Two Records: [('user1', '2024-12-19', 'IN', 5), ('user2', '2024-12-19', 'OUT', 15)]


In [None]:
# Example 1: Take the first three records
first_three_records = call_rdd.take(3)
print("First Three Records:", first_three_records)

First Three Records: [('user1', '2024-12-19', 'IN', 5), ('user2', '2024-12-19', 'OUT', 15), ('user3', '2024-12-19', 'IN', 7)]


In [None]:
# Example 2: Take the first four distinct user IDs
first_four_users = unique_user_ids.take(4)
print("First Four Unique User IDs:", first_four_users)

First Four Unique User IDs: ['user1', 'user2', 'user4', 'user5']


## Case Study - Identification of Gold plan customers

 Identify users with "Gold" plan who have made more than 10 minutes of outgoing calls in December 2024.

  This could help understand usage patterns of specific customer segments.




```

# Sample Telecom Data (Call Records and User Data)
call_data = [
    ("user1", "2024-12-19", "IN", 5, "A"),  # Added plan type
    ("user2", "2024-12-19", "OUT", 15, "B"),
    ("user3", "2024-12-19", "IN", 7, "A"),
    ("user1", "2024-12-20", "OUT", 20, "A"),
    ("user2", "2024-12-20", "IN", 10, "B"),
    ("user3", "2024-12-20", "OUT", 30, "C"),
    ("user4", "2024-12-20", "IN", 12, "A"),
    ("user5", "2024-12-19", "OUT", 25, "B")
]

user_data = [
    ("user1", "Silver", 25),
    ("user2", "Gold", 40),
    ("user3", "Silver", 30),
    ("user4", "Platinum", 55),
    ("user5", "Gold", 35)
]

```



In [None]:

call_data = [
    ("user1", "2024-12-19", "IN", 5, "A"),  # Added plan type
    ("user2", "2024-12-19", "OUT", 15, "B"),
    ("user3", "2024-12-19", "IN", 7, "A"),
    ("user1", "2024-12-20", "OUT", 20, "A"),
    ("user2", "2024-12-20", "IN", 10, "B"),
    ("user3", "2024-12-20", "OUT", 30, "C"),
    ("user4", "2024-12-20", "IN", 12, "A"),
    ("user5", "2024-12-19", "OUT", 25, "B")
]

user_data = [
    ("user1", "Silver", 25),
    ("user2", "Gold", 40),
    ("user3", "Silver", 30),
    ("user4", "Platinum", 55),
    ("user5", "Gold", 35)
]


In [None]:
# Create RDDs
call_rdd = sc.parallelize(call_data)
user_rdd = sc.parallelize(user_data)

In [None]:
# Additional analysis could leverage join operations.
# Example: Join call data with user data to get more insights into the users with long calls
joined_rdd = call_rdd.map(lambda x: (x[0], x)).join(user_rdd.map(lambda x: (x[0], x)))
print("Joined RDD:", joined_rdd.collect())

Joined RDD: [('user1', (('user1', '2024-12-19', 'IN', 5, 'A'), ('user1', 'Silver', 25))), ('user1', (('user1', '2024-12-20', 'OUT', 20, 'A'), ('user1', 'Silver', 25))), ('user4', (('user4', '2024-12-20', 'IN', 12, 'A'), ('user4', 'Platinum', 55))), ('user5', (('user5', '2024-12-19', 'OUT', 25, 'B'), ('user5', 'Gold', 35))), ('user2', (('user2', '2024-12-19', 'OUT', 15, 'B'), ('user2', 'Gold', 40))), ('user2', (('user2', '2024-12-20', 'IN', 10, 'B'), ('user2', 'Gold', 40))), ('user3', (('user3', '2024-12-19', 'IN', 7, 'A'), ('user3', 'Silver', 30))), ('user3', (('user3', '2024-12-20', 'OUT', 30, 'C'), ('user3', 'Silver', 30)))]


In [None]:
#Additional follow up questions

# 1. Filter for Gold plan users
gold_users_rdd = user_rdd.filter(lambda x: x[1] == "Gold")

# 2. Extract user IDs from the gold users RDD
gold_user_ids = gold_users_rdd.map(lambda x: x[0])

In [None]:
# 3. Filter for outgoing calls
outgoing_calls_rdd = call_rdd.filter(lambda x: x[2] == "OUT")

# 4. Filter for Gold users with more than 10-minute outgoing calls
targeted_users_rdd = outgoing_calls_rdd.filter(lambda x: x[3] > 10).filter(lambda x : x[0] in gold_user_ids.collect())


## Cleanup
Always stop the SparkSession after completion of tasks.

In [None]:
sc.stop()