# Lab 2: Spark SQL and DataFrames

## Tasks

1. **Markdown Cells:**
    - Introduction to Spark SQL and DataFrames.
    - Explanation of key concepts in Spark SQL and DataFrames.
    - Description of the PySpark lab and its objectives.
    - Explanation of various SQL operations and functions.

2. **Setup and Configuration:**
    - Install and configure PySpark.
    - Initialize a SparkSession.

3. **Define Schema:**
    - Define the schema for the synthetic telecom dataset using `StructType` and `StructField`.

4. **Create Synthetic Dataset:**
    - Create a synthetic dataset with sample telecom data.

5. **Create DataFrame:**
    - Create a DataFrame from the synthetic dataset using the defined schema.
    - Display the DataFrame.

6. **Register DataFrame as SQL Temporary View:**
    - Register the DataFrame as a temporary SQL view named "Telecom".

7. **SQL Queries:**
    - Perform various SQL queries on the "Telecom" view, including:
      - Selecting specific columns.
      - Filtering rows based on conditions.
      - Using aggregate functions (e.g., `AVG`, `SUM`, `COUNT`).
      - Ordering results.
      - Combining conditions.
      - Performing string operations.
      - Using SQL subqueries.
      - Using `IN` and `DISTINCT`.
      - Using `GROUP BY` with `HAVING`.
      - Using `CASE` for conditional logic.

8. **Example SQL Queries:**
    - Find all customers who churned.
    - Calculate the average monthly charges for each plan.
    - Find the customer with the highest data usage.
    - List customers with tenure greater than 1 year and monthly charges less than $60.
    - Count the number of customers in each city.


## Introduction to Spark SQL and DataFrames

Spark SQL allows querying structured data using SQL-like syntax, while DataFrames provide a distributed data abstraction with named columns.
This lab will focus on performing SQL queries and DataFrame operations for structured data processing.

## PySpark Lab: Spark SQL and DataFrames
### Introduction to Spark SQL and DataFrames
Spark SQL allows querying structured data using SQL-like syntax, while DataFrames provide a distributed data abstraction with named columns.

#Spark SQL

Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.  It allows querying data in various formats (like Parquet, JSON, CSV, Hive tables) using SQL or a DataFrame API in Python, Scala, Java, and R.

#Key Concepts:
Catalyst Optimizer: A powerful query optimizer that analyzes and rewrites queries to generate efficient execution plans.  This is crucial for performance. It utilizes cost-based optimization and supports various query optimization techniques.

DataFrame API: The DataFrame API provides a set of functions (similar to those in pandas) to manipulate structured data. The DataFrame API is often preferred for complex transformations, or when you want fine-grained control over data processing.

SQL Queries: Traditional SQL queries can be executed against Spark DataFrames. This provides familiarity for users comfortable with SQL and allows integration with existing SQL-based ETL workflows.

Data Sources: Spark SQL supports reading from and writing to various data sources, including Hive tables, JSON files, CSV files, Parquet files, ORC files, and JDBC data sources.


### DataFrames

DataFrames are conceptually equivalent to tables in relational databases or dataframes in Python (pandas). They organize data into named columns, similar to a spreadsheet or SQL table.  Critically, DataFrames are distributed across a cluster.

#Key Features:

Immutability: DataFrames in Spark are immutable. Operations on a DataFrame create a *new* DataFrame with the changes, leaving the original unchanged.

Lazy Evaluation: Most operations on DataFrames are not executed immediately. Instead, they are added to a directed acyclic graph (DAG) of operations. This DAG is optimized by the Catalyst Optimizer before the actual execution begins.

Schema: DataFrames have a defined schema, describing the data types of each column.  This helps in efficient data processing, as Spark knows the structure of the data.

Schema Inference: Spark can infer the schema of a DataFrame from data sources like JSON or CSV files if you do not provide an explicit schema.

Distributed Computing: DataFrames are distributed across the Spark cluster, enabling parallel processing of large datasets.

Operations: A variety of operations are available for manipulating DataFrames including filtering, aggregation (group by, sum, count, average, min, max), joins, and transformations.


#Interoperability

Spark SQL and DataFrames are highly interoperable. SQL queries can be used on DataFrames and the results can be transformed back to DataFrames for further processing.  Converting between DataFrames and RDDs is also possible but often not as efficient as native DataFrame operations.


#Performance Considerations

Data Partitioning: Proper data partitioning is crucial for performance. Partitioning the data appropriately ensures data locality, minimizing data shuffling between executors.

Catalyst Optimizer: Understanding how the Catalyst Optimizer works is vital. For complex queries, analyzing the execution plan can help identify areas for improvement.

Data Serialization: Choosing appropriate data serialization formats (e.g., Parquet) can significantly reduce storage and processing costs.

In [None]:
#Install and Configure PySpark
!pip install pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType

# ## Initialize SparkSession

spark = SparkSession \
    .builder \
    .appName("Spark SQL Telecom Examples") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()



## Define schema for synthetic telecom dataset

In [None]:
schema = StructType([
    StructField("CustomerID", StringType(), True),
    StructField("Plan", StringType(), True),
    StructField("City", StringType(), True),
    StructField("DataUsageGB", FloatType(), True),
    StructField("MonthlyCharges", FloatType(), True),
    StructField("TenureMonths", IntegerType(), True),
    StructField("Churn", StringType(), True)
])

## Create synthetic dataset

In [None]:
telecom_data = [
    ("C001", "Basic", "New York", 2.5, 20.0, 12, "No"),
    ("C002", "Premium", "Los Angeles", 50.0, 100.0, 24, "Yes"),
    ("C003", "Basic", "Chicago", 3.0, 25.0, 8, "No"),
    ("C004", "Standard", "Houston", 10.0, 50.0, 15, "No"),
    ("C005", "Premium", "Phoenix", 60.0, 120.0, 30, "Yes"),
    ("C006", "Standard", "Seattle", 20.0, 60.0, 18, "No"),
    ("C007", "Basic", "Denver", 1.0, 15.0, 5, "No"),
    ("C008", "Premium", "San Francisco", 100.0, 200.0, 36, "Yes")
]

## Create DataFrame

In [None]:
df = spark.createDataFrame(telecom_data, schema=schema)

# Display the DataFrame
df.show()

+----------+--------+-------------+-----------+--------------+------------+-----+
|CustomerID|    Plan|         City|DataUsageGB|MonthlyCharges|TenureMonths|Churn|
+----------+--------+-------------+-----------+--------------+------------+-----+
|      C001|   Basic|     New York|        2.5|          20.0|          12|   No|
|      C002| Premium|  Los Angeles|       50.0|         100.0|          24|  Yes|
|      C003|   Basic|      Chicago|        3.0|          25.0|           8|   No|
|      C004|Standard|      Houston|       10.0|          50.0|          15|   No|
|      C005| Premium|      Phoenix|       60.0|         120.0|          30|  Yes|
|      C006|Standard|      Seattle|       20.0|          60.0|          18|   No|
|      C007|   Basic|       Denver|        1.0|          15.0|           5|   No|
|      C008| Premium|San Francisco|      100.0|         200.0|          36|  Yes|
+----------+--------+-------------+-----------+--------------+------------+-----+



## Register DataFrame as SQL temporary view

In [None]:
df.createOrReplaceTempView("Telecom")

# Selecting specific columns

In [None]:
# Example 1
spark.sql("SELECT CustomerID, Plan, MonthlyCharges FROM Telecom").show()

+----------+--------+--------------+
|CustomerID|    Plan|MonthlyCharges|
+----------+--------+--------------+
|      C001|   Basic|          20.0|
|      C002| Premium|         100.0|
|      C003|   Basic|          25.0|
|      C004|Standard|          50.0|
|      C005| Premium|         120.0|
|      C006|Standard|          60.0|
|      C007|   Basic|          15.0|
|      C008| Premium|         200.0|
+----------+--------+--------------+



In [None]:
# Example 2
spark.sql("SELECT Plan, City FROM Telecom").show()

+--------+-------------+
|    Plan|         City|
+--------+-------------+
|   Basic|     New York|
| Premium|  Los Angeles|
|   Basic|      Chicago|
|Standard|      Houston|
| Premium|      Phoenix|
|Standard|      Seattle|
|   Basic|       Denver|
| Premium|San Francisco|
+--------+-------------+



In [None]:
# Example 3
spark.sql("SELECT CustomerID, TenureMonths FROM Telecom").show()

+----------+------------+
|CustomerID|TenureMonths|
+----------+------------+
|      C001|          12|
|      C002|          24|
|      C003|           8|
|      C004|          15|
|      C005|          30|
|      C006|          18|
|      C007|           5|
|      C008|          36|
+----------+------------+



#Filtering rows based on conditions

In [None]:
# Example 1
spark.sql("SELECT * FROM Telecom WHERE Churn = 'Yes'").show()

+----------+-------+-------------+-----------+--------------+------------+-----+
|CustomerID|   Plan|         City|DataUsageGB|MonthlyCharges|TenureMonths|Churn|
+----------+-------+-------------+-----------+--------------+------------+-----+
|      C002|Premium|  Los Angeles|       50.0|         100.0|          24|  Yes|
|      C005|Premium|      Phoenix|       60.0|         120.0|          30|  Yes|
|      C008|Premium|San Francisco|      100.0|         200.0|          36|  Yes|
+----------+-------+-------------+-----------+--------------+------------+-----+



In [None]:
# Example 2
spark.sql("SELECT CustomerID, DataUsageGB FROM Telecom WHERE DataUsageGB > 50").show()

+----------+-----------+
|CustomerID|DataUsageGB|
+----------+-----------+
|      C005|       60.0|
|      C008|      100.0|
+----------+-----------+



In [None]:
# Example 3
spark.sql("SELECT CustomerID, TenureMonths FROM Telecom WHERE TenureMonths < 12").show()

+----------+------------+
|CustomerID|TenureMonths|
+----------+------------+
|      C003|           8|
|      C007|           5|
+----------+------------+



#Aggregate functions

In [None]:
# Example 1
spark.sql("SELECT Plan, AVG(MonthlyCharges) as AvgCharges FROM Telecom GROUP BY Plan").show()

+--------+----------+
|    Plan|AvgCharges|
+--------+----------+
| Premium|     140.0|
|   Basic|      20.0|
|Standard|      55.0|
+--------+----------+



In [None]:
# Example 2
spark.sql("SELECT Plan, SUM(DataUsageGB) as TotalDataUsage FROM Telecom GROUP BY Plan").show()

+--------+--------------+
|    Plan|TotalDataUsage|
+--------+--------------+
| Premium|         210.0|
|   Basic|           6.5|
|Standard|          30.0|
+--------+--------------+



In [None]:
# Example 3
spark.sql("SELECT City, COUNT(CustomerID) as CustomerCount FROM Telecom GROUP BY City").show()

+-------------+-------------+
|         City|CustomerCount|
+-------------+-------------+
|  Los Angeles|            1|
|      Chicago|            1|
|      Houston|            1|
|     New York|            1|
|      Phoenix|            1|
|San Francisco|            1|
|      Seattle|            1|
|       Denver|            1|
+-------------+-------------+



#Order results


In [None]:
#Top 3 customers with highest monthly charges
spark.sql("SELECT CustomerID, MonthlyCharges FROM Telecom ORDER BY MonthlyCharges DESC LIMIT 3").show()

+----------+--------------+
|CustomerID|MonthlyCharges|
+----------+--------------+
|      C008|         200.0|
|      C005|         120.0|
|      C002|         100.0|
+----------+--------------+



In [None]:
#Customers with the lowest data usage
spark.sql("SELECT CustomerID, DataUsageGB FROM Telecom ORDER BY DataUsageGB ASC LIMIT 3").show()

+----------+-----------+
|CustomerID|DataUsageGB|
+----------+-----------+
|      C007|        1.0|
|      C001|        2.5|
|      C003|        3.0|
+----------+-----------+



In [None]:
#Cities with the highest number of customers
spark.sql("SELECT City, COUNT(CustomerID) as CustomerCount FROM Telecom GROUP BY City ORDER BY CustomerCount DESC").show()

+-------------+-------------+
|         City|CustomerCount|
+-------------+-------------+
|  Los Angeles|            1|
|      Chicago|            1|
|      Houston|            1|
|     New York|            1|
|      Phoenix|            1|
|San Francisco|            1|
|      Seattle|            1|
|       Denver|            1|
+-------------+-------------+



#Combine conditions


In [None]:
#Customers with Premium plan and tenure greater than 24 months
spark.sql("SELECT CustomerID, Plan, TenureMonths FROM Telecom WHERE Plan = 'Premium' AND TenureMonths > 24").show()

+----------+-------+------------+
|CustomerID|   Plan|TenureMonths|
+----------+-------+------------+
|      C005|Premium|          30|
|      C008|Premium|          36|
+----------+-------+------------+



In [None]:
#Customers in New York or Los Angeles
spark.sql("SELECT CustomerID, City FROM Telecom WHERE City IN ('New York', 'Los Angeles')").show()

+----------+-----------+
|CustomerID|       City|
+----------+-----------+
|      C001|   New York|
|      C002|Los Angeles|
+----------+-----------+



In [None]:
#Customers not on the Basic plan
spark.sql("SELECT CustomerID, Plan FROM Telecom WHERE Plan != 'Basic'").show()

+----------+--------+
|CustomerID|    Plan|
+----------+--------+
|      C002| Premium|
|      C004|Standard|
|      C005| Premium|
|      C006|Standard|
|      C008| Premium|
+----------+--------+



#String operations


In [None]:
#Customers whose City starts with 'S'
spark.sql("SELECT CustomerID, City FROM Telecom WHERE City LIKE 'S%'").show()

+----------+-------------+
|CustomerID|         City|
+----------+-------------+
|      C006|      Seattle|
|      C008|San Francisco|
+----------+-------------+



In [None]:
#Extract first three letters of the City name
spark.sql("SELECT CustomerID, SUBSTRING(City, 1, 3) AS CityPrefix FROM Telecom").show()

+----------+----------+
|CustomerID|CityPrefix|
+----------+----------+
|      C001|       New|
|      C002|       Los|
|      C003|       Chi|
|      C004|       Hou|
|      C005|       Pho|
|      C006|       Sea|
|      C007|       Den|
|      C008|       San|
+----------+----------+



In [None]:
#Customers whose City does not contain 'a'
spark.sql("SELECT CustomerID, City FROM Telecom WHERE City NOT LIKE '%a%'").show()

+----------+-----------+
|CustomerID|       City|
+----------+-----------+
|      C001|   New York|
|      C002|Los Angeles|
|      C004|    Houston|
|      C005|    Phoenix|
|      C007|     Denver|
+----------+-----------+



#SQL Subqueries


In [None]:
#Customers with charges above the average monthly charge
spark.sql("SELECT CustomerID, MonthlyCharges FROM Telecom WHERE MonthlyCharges > (SELECT AVG(MonthlyCharges) FROM Telecom)").show()

+----------+--------------+
|CustomerID|MonthlyCharges|
+----------+--------------+
|      C002|         100.0|
|      C005|         120.0|
|      C008|         200.0|
+----------+--------------+



In [None]:
#Cities with total data usage above average
spark.sql("SELECT City FROM (SELECT City, SUM(DataUsageGB) AS TotalData FROM Telecom GROUP BY City) WHERE TotalData > (SELECT AVG(SUM(DataUsageGB)) OVER() FROM Telecom)").show()


+----+
|City|
+----+
+----+



In [None]:
#Premium customers with above-average data usage
spark.sql("SELECT CustomerID, DataUsageGB FROM Telecom WHERE Plan = 'Premium' AND DataUsageGB > (SELECT AVG(DataUsageGB) FROM Telecom)").show()

+----------+-----------+
|CustomerID|DataUsageGB|
+----------+-----------+
|      C002|       50.0|
|      C005|       60.0|
|      C008|      100.0|
+----------+-----------+



#Use of IN and DISTINCT


In [None]:
#Customers in specific cities
spark.sql("SELECT DISTINCT CustomerID, City FROM Telecom WHERE City IN ('New York', 'Chicago', 'Phoenix')").show()

+----------+--------+
|CustomerID|    City|
+----------+--------+
|      C001|New York|
|      C003| Chicago|
|      C005| Phoenix|
+----------+--------+



In [None]:
#Plans available in the dataset
spark.sql("SELECT DISTINCT Plan FROM Telecom").show()

+--------+
|    Plan|
+--------+
| Premium|
|   Basic|
|Standard|
+--------+



In [None]:
#Distinct customer churn statuses
spark.sql("SELECT DISTINCT Churn FROM Telecom").show()

+-----+
|Churn|
+-----+
|   No|
|  Yes|
+-----+



#Use of GROUP BY with HAVING


In [None]:
#Cities with more than 1 customer
spark.sql("SELECT City, COUNT(CustomerID) as CustomerCount FROM Telecom GROUP BY City HAVING CustomerCount > 1").show()

+----+-------------+
|City|CustomerCount|
+----+-------------+
+----+-------------+



In [None]:
#Plans with average charges above 50
spark.sql("SELECT Plan, AVG(MonthlyCharges) as AvgCharges FROM Telecom GROUP BY Plan HAVING AvgCharges > 50").show()

+--------+----------+
|    Plan|AvgCharges|
+--------+----------+
| Premium|     140.0|
|Standard|      55.0|
+--------+----------+



In [None]:
#Cities with total data usage greater than 100
spark.sql("SELECT City, SUM(DataUsageGB) as TotalDataUsage FROM Telecom GROUP BY City HAVING TotalDataUsage > 100").show()

+----+--------------+
|City|TotalDataUsage|
+----+--------------+
+----+--------------+



#Use of CASE


In [None]:
#Categorize customers based on MonthlyCharges
spark.sql("SELECT CustomerID, MonthlyCharges, CASE WHEN MonthlyCharges < 50 THEN 'Low' WHEN MonthlyCharges BETWEEN 50 AND 100 THEN 'Medium' ELSE 'High' END AS ChargeCategory FROM Telecom").show()

+----------+--------------+--------------+
|CustomerID|MonthlyCharges|ChargeCategory|
+----------+--------------+--------------+
|      C001|          20.0|           Low|
|      C002|         100.0|        Medium|
|      C003|          25.0|           Low|
|      C004|          50.0|        Medium|
|      C005|         120.0|          High|
|      C006|          60.0|        Medium|
|      C007|          15.0|           Low|
|      C008|         200.0|          High|
+----------+--------------+--------------+



In [None]:
#Categorize tenure into short, medium, and long
spark.sql("SELECT CustomerID, TenureMonths, CASE WHEN TenureMonths < 12 THEN 'Short' WHEN TenureMonths BETWEEN 12 AND 24 THEN 'Medium' ELSE 'Long' END AS TenureCategory FROM Telecom").show()

+----------+------------+--------------+
|CustomerID|TenureMonths|TenureCategory|
+----------+------------+--------------+
|      C001|          12|        Medium|
|      C002|          24|        Medium|
|      C003|           8|         Short|
|      C004|          15|        Medium|
|      C005|          30|          Long|
|      C006|          18|        Medium|
|      C007|           5|         Short|
|      C008|          36|          Long|
+----------+------------+--------------+



In [None]:
#Flag customers based on data usage
spark.sql("SELECT CustomerID, DataUsageGB, CASE WHEN DataUsageGB > 50 THEN 'High Usage' ELSE 'Normal Usage' END AS UsageCategory FROM Telecom").show()

+----------+-----------+-------------+
|CustomerID|DataUsageGB|UsageCategory|
+----------+-----------+-------------+
|      C001|        2.5| Normal Usage|
|      C002|       50.0| Normal Usage|
|      C003|        3.0| Normal Usage|
|      C004|       10.0| Normal Usage|
|      C005|       60.0|   High Usage|
|      C006|       20.0| Normal Usage|
|      C007|        1.0| Normal Usage|
|      C008|      100.0|   High Usage|
+----------+-----------+-------------+





```
# telecom_data = [
    ("C001", "Basic", "New York", 2.5, 20.0, 12, "No"),
    ("C002", "Premium", "Los Angeles", 50.0, 100.0, 24, "Yes"),
    ("C003", "Basic", "Chicago", 3.0, 25.0, 8, "No"),
    ("C004", "Standard", "Houston", 10.0, 50.0, 15, "No"),
    ("C005", "Premium", "Phoenix", 60.0, 120.0, 30, "Yes"),
    ("C006", "Standard", "Seattle", 20.0, 60.0, 18, "No"),
    ("C007", "Basic", "Denver", 1.0, 15.0, 5, "No"),
    ("C008", "Premium", "San Francisco", 100.0, 200.0, 36, "Yes")
]
```



In [None]:
# Define schema for synthetic telecom dataset
schema = StructType([
    StructField("CustomerID", StringType(), True),
    StructField("Plan", StringType(), True),
    StructField("City", StringType(), True),
    StructField("DataUsageGB", FloatType(), True),
    StructField("MonthlyCharges", FloatType(), True),
    StructField("TenureMonths", IntegerType(), True),
    StructField("Churn", StringType(), True)
])

In [None]:
# Create synthetic dataset
telecom_data = [
    ("C001", "Basic", "New York", 2.5, 20.0, 12, "No"),
    ("C002", "Premium", "Los Angeles", 50.0, 100.0, 24, "Yes"),
    ("C003", "Basic", "Chicago", 3.0, 25.0, 8, "No"),
    ("C004", "Standard", "Houston", 10.0, 50.0, 15, "No"),
    ("C005", "Premium", "Phoenix", 60.0, 120.0, 30, "Yes"),
    ("C006", "Standard", "Seattle", 20.0, 60.0, 18, "No"),
    ("C007", "Basic", "Denver", 1.0, 15.0, 5, "No"),
    ("C008", "Premium", "San Francisco", 100.0, 200.0, 36, "Yes")
]

In [None]:
# Create DataFrame
df = spark.createDataFrame(telecom_data, schema=schema)

In [None]:
# Display the DataFrame
df.show()

+----------+--------+-------------+-----------+--------------+------------+-----+
|CustomerID|    Plan|         City|DataUsageGB|MonthlyCharges|TenureMonths|Churn|
+----------+--------+-------------+-----------+--------------+------------+-----+
|      C001|   Basic|     New York|        2.5|          20.0|          12|   No|
|      C002| Premium|  Los Angeles|       50.0|         100.0|          24|  Yes|
|      C003|   Basic|      Chicago|        3.0|          25.0|           8|   No|
|      C004|Standard|      Houston|       10.0|          50.0|          15|   No|
|      C005| Premium|      Phoenix|       60.0|         120.0|          30|  Yes|
|      C006|Standard|      Seattle|       20.0|          60.0|          18|   No|
|      C007|   Basic|       Denver|        1.0|          15.0|           5|   No|
|      C008| Premium|San Francisco|      100.0|         200.0|          36|  Yes|
+----------+--------+-------------+-----------+--------------+------------+-----+



In [None]:
# Register DataFrame as SQL temporary view
df.createOrReplaceTempView("Telecom")

Example SQL Queries:


In [None]:
# Find all customers who churned.
# Expected Result: Customers who have churned ('Yes')
spark.sql("SELECT * FROM Telecom WHERE Churn = 'Yes'").show()

+----------+-------+-------------+-----------+--------------+------------+-----+
|CustomerID|   Plan|         City|DataUsageGB|MonthlyCharges|TenureMonths|Churn|
+----------+-------+-------------+-----------+--------------+------------+-----+
|      C002|Premium|  Los Angeles|       50.0|         100.0|          24|  Yes|
|      C005|Premium|      Phoenix|       60.0|         120.0|          30|  Yes|
|      C008|Premium|San Francisco|      100.0|         200.0|          36|  Yes|
+----------+-------+-------------+-----------+--------------+------------+-----+



In [None]:
# Calculate the average monthly charges for each plan.
spark.sql("SELECT Plan, AVG(MonthlyCharges) AS AverageCharges FROM Telecom GROUP BY Plan").show()

+--------+--------------+
|    Plan|AverageCharges|
+--------+--------------+
| Premium|         140.0|
|   Basic|          20.0|
|Standard|          55.0|
+--------+--------------+



In [None]:
# Find the customer with the highest data usage.
spark.sql("SELECT CustomerID, DataUsageGB FROM Telecom ORDER BY DataUsageGB DESC LIMIT 1").show()

+----------+-----------+
|CustomerID|DataUsageGB|
+----------+-----------+
|      C008|      100.0|
+----------+-----------+



In [None]:
# List customers with tenure greater than 1 year and monthly charges less than $60.
spark.sql("SELECT CustomerID, TenureMonths, MonthlyCharges FROM Telecom WHERE TenureMonths > 12 AND MonthlyCharges < 60").show()

+----------+------------+--------------+
|CustomerID|TenureMonths|MonthlyCharges|
+----------+------------+--------------+
|      C004|          15|          50.0|
+----------+------------+--------------+



In [None]:
# Count the number of customers in each city.
spark.sql("SELECT City, COUNT(*) AS CustomerCount FROM Telecom GROUP BY City").show()

+-------------+-------------+
|         City|CustomerCount|
+-------------+-------------+
|  Los Angeles|            1|
|      Chicago|            1|
|      Houston|            1|
|     New York|            1|
|      Phoenix|            1|
|San Francisco|            1|
|      Seattle|            1|
|       Denver|            1|
+-------------+-------------+

