## Spark Interview Question

**Question:**  
Given a DataFrame containing customer reports, how would you extract 10-digit mobile numbers from the text using Databricks?

### Approach:
- **Method 1:** Traditional Spark Approach
- **Method 2:** AI Approach


----------------------------------------------------------------------------------------------------------
**Input Data Example:**

| customer_id | report_text                                                      |
|-------------|------------------------------------------------------------------|
| 1           | "Customer called from 9876543210 regarding billing issue."       |
| 2           | "Contact number: 9123456789. Requested account update."          |
| 3           | "No mobile number provided in this report."                      |
| 4           | "Reach me at 9988776655 for further details."                    |
| 5           | "Alternate contact: 9001122334, main: 8002233445."               |

# Create Input DF

In [0]:
data = [
    ("i have a issue with my billing. My mob number is 7864321909"),
    ("Billing issue identified in 2389110328"),
    ("number 4563210099 have internet connection error"),
    ("No Mobile Number provided"),
    ("My main number is 8002233445. Alternate number is 9001122334")

]

df = spark.createDataFrame(data, ["Customer report"])
display(df)

Customer report
i have a issue with my billing. My mob number is 7864321909
Billing issue identified in 2389110328
number 4563210099 have internet connection error
No Mobile Number provided
My main number is 8002233445. Alternate number is 9001122334


# Traditional Approach

In [0]:
# Import regex extract function
from pyspark.sql.functions import regexp_extract

# Extract mobile number from the text
# \b - word boundary
# \d{10} - 10 digits
# \b - word boundary
df_with_mobile = df.withColumn("Mobile_Number_Regex", regexp_extract(df["Customer report"], r"\b\d{10}\b", 0))

# Display output
display(df_with_mobile)

Customer report,Mobile_Number_Regex
i have a issue with my billing. My mob number is 7864321909,7864321909.0
Billing issue identified in 2389110328,2389110328.0
number 4563210099 have internet connection error,4563210099.0
No Mobile Number provided,
My main number is 8002233445. Alternate number is 9001122334,8002233445.0


# Leverage AI Functions

In [0]:
# Create a Temporary DF View
df.createOrReplaceTempView("customer_reports")

# Sql query to extract mobile number using AI function
df_with_mobile = spark.sql("""
SELECT
  *,
  ai_extract(`Customer report`, array("mobile number")) AS Mobile_Number_AI
FROM
  customer_reports
""")

# Display output
display(df_with_mobile)

Customer report,Mobile_Number_AI
i have a issue with my billing. My mob number is 7864321909,List(7864321909)
Billing issue identified in 2389110328,List(2389110328)
number 4563210099 have internet connection error,List(4563210099)
No Mobile Number provided,List(null)
My main number is 8002233445. Alternate number is 9001122334,List(8002233445)


# Can we include Alternate Number ?

In [0]:
df_with_mobile = spark.sql("""
SELECT
  *,
  ai_extract(`Customer report`, array("mobile number","alternate number")) AS Mobile_Number_AI
FROM
  customer_reports
""")
display(df_with_mobile)

Customer report,Mobile_Number_AI
i have a issue with my billing. My mob number is 7864321909,"List(7864321909, null)"
Billing issue identified in 2389110328,"List(2389110328, null)"
number 4563210099 have internet connection error,"List(4563210099, null)"
No Mobile Number provided,"List(null, null)"
My main number is 8002233445. Alternate number is 9001122334,"List(8002233445, 9001122334)"
