#### **locate()**

- The **locate()** function in PySpark is used to find the **position of a substring** within a **string**.

- It works just like SQL's **INSTR() or POSITION()** functions.

- The position is **not zero based**, but **1 based index**. Returns **0 if substr could not be found in str**.

- Locate the position of the **first occurrence** of substr in a string column, after position pos.

- If **more than one occurrence** is there in a string. It will result the **position** of the **first occurrence**.

#### **Syntax**

     locate(substr, str[, pos])

**substr:** the substring to find

**str:** the column where you want to search

**pos (optional):** the position to start searching from (1-based index)

In [0]:
from pyspark.sql.functions import substring, concat, lit, col, expr, locate, when

     df1 = df.withColumn("loc", locate(";", col("RecurrencePattern")))
- which **searches** from the **beginning of the string** (i.e., **position 1**).

- This is **equivalent** to **locate(";", col("RecurrencePattern"), 1)**.

     df1 = df.withColumn("loc", locate(";", col("RecurrencePattern"), 1)) 
- This explicitly sets the **start position to 1**, which is also the **beginning of the string**.

     locate(";", col("RecurrencePattern"), 5)
- This would start **searching** from the **5th character**.

#### **1) Find Position of Substring**
- In this case, if **"data"** is found, the **position** will show the **index** of its **first occurrence**.

In [0]:
from pyspark.sql.functions import locate

# Sample data
data = [("Azure data engineer (ADE)", "suman@gmail.com"),
        ("AWS data engineer (AWS)", "kiranrathod@gmail.com"),
        ("data warehouse", "rameshwaran@gmail.com"),
        ("GCP engineer", "krishnamurthy@gmail.com"),
        ("PySpark engineer", "vishweswarrao@gmail.com")]

columns = ["text", "mail"]

df = spark.createDataFrame(data, columns)
display(df)

text,mail
Azure data engineer (ADE),suman@gmail.com
AWS data engineer (AWS),kiranrathod@gmail.com
data warehouse,rameshwaran@gmail.com
GCP engineer,krishnamurthy@gmail.com
PySpark engineer,vishweswarrao@gmail.com


In [0]:
# Use locate() to find the position of 'data'
df_pos = df.withColumn("position", locate("data", df["text"]))

# Show the DataFrame with the position column
display(df_pos)

text,mail,position
Azure data engineer (ADE),suman@gmail.com,7
AWS data engineer (AWS),kiranrathod@gmail.com,5
data warehouse,rameshwaran@gmail.com,1
GCP engineer,krishnamurthy@gmail.com,0
PySpark engineer,vishweswarrao@gmail.com,0


In [0]:
# Filter rows where 'data' is found in 'text'
df_filtered = df \
    .withColumn("position", locate("data", df["text"])) \
    .filter(locate("data", col("text")) > 0)
display(df_filtered)

text,mail,position
Azure data engineer (ADE),suman@gmail.com,7
AWS data engineer (AWS),kiranrathod@gmail.com,5
data warehouse,rameshwaran@gmail.com,1


#### **2) Using locate to Find Substring in a Specific Column**
- This finds the **position** of the **"@"** symbol in the **email** column.
- The result will be the **index** position where **"@"** appears in **each email string**.

In [0]:
df_email = df.withColumn("position_of_email", locate("@", col("mail")))
display(df_email)

text,mail,position_of_email
Azure data engineer (ADE),suman@gmail.com,6
AWS data engineer (AWS),kiranrathod@gmail.com,12
data warehouse,rameshwaran@gmail.com,12
GCP engineer,krishnamurthy@gmail.com,14
PySpark engineer,vishweswarrao@gmail.com,14


#### **3) Finding Position of Substring in a Column with Multiple Occurrences**

- If you have a **string** with **multiple occurrences** of the **substring** and want to know the **position** of the **first occurrence**.

In [0]:
from pyspark.sql.functions import locate

# Sample data
data = [("Azure data engineer data world", "suman@gmail.com"),
        ("AWS data engineer data type", "kiranrathod@gmail.com"),
        ("data warehouse data storage", "rameshwaran@gmail.com"),
        ("GCP engineer", "krishnamurthy@gmail.com"),
        ("PySpark engineer", "vishweswarrao@gmail.com")]

columns = ["text", "mail"]

dff = spark.createDataFrame(data, columns)
display(dff)

text,mail
Azure data engineer data world,suman@gmail.com
AWS data engineer data type,kiranrathod@gmail.com
data warehouse data storage,rameshwaran@gmail.com
GCP engineer,krishnamurthy@gmail.com
PySpark engineer,vishweswarrao@gmail.com


In [0]:
df_mltpl = dff.withColumn("position", locate("data", col("text")))
display(df_mltpl)

text,mail,position
Azure data engineer data world,suman@gmail.com,7
AWS data engineer data type,kiranrathod@gmail.com,5
data warehouse data storage,rameshwaran@gmail.com,1
GCP engineer,krishnamurthy@gmail.com,0
PySpark engineer,vishweswarrao@gmail.com,0


#### **4) Handling Missing Substrings with locate**

- When the **substring doesn’t exist** in the string, locate() will return **0**.

In [0]:
df_miss = df.withColumn("position", locate("banana", col("mail")))
display(df_miss)

text,mail,position
Azure data engineer (ADE),suman@gmail.com,0
AWS data engineer (AWS),kiranrathod@gmail.com,0
data warehouse,rameshwaran@gmail.com,0
GCP engineer,krishnamurthy@gmail.com,0
PySpark engineer,vishweswarrao@gmail.com,0


#### **5) Case Sensitivity in locate**

- locate() is **case-sensitive**, so make sure to account for this.

In [0]:
# Only finds "data", not "Data"
df_sens = df.withColumn("position", locate("data", col("text"))) \
            .withColumn("pos", locate("Data", col("text")))
display(df_sens)

text,mail,position,pos
Azure data engineer (ADE),suman@gmail.com,7,0
AWS data engineer (AWS),kiranrathod@gmail.com,5,0
data warehouse,rameshwaran@gmail.com,1,0
GCP engineer,krishnamurthy@gmail.com,0,0
PySpark engineer,vishweswarrao@gmail.com,0,0


#### **6) Check If Substring Exists (Using locate with when and col)**
- Here, **data_present** will be **True** if the substring **data** is found and **False** otherwise.

In [0]:
from pyspark.sql.functions import when

# Create a new column "data_present" that returns True if 'data' is found, otherwise False.
df_with_conditional = df \
    .withColumn("location", locate("data", col("text"))) \
    .withColumn("data_present", when(locate("data", col("text")) > 0, True).otherwise(False))
display(df_with_conditional)

text,mail,location,data_present
Azure data engineer (ADE),suman@gmail.com,7,True
AWS data engineer (AWS),kiranrathod@gmail.com,5,True
data warehouse,rameshwaran@gmail.com,1,True
GCP engineer,krishnamurthy@gmail.com,0,False
PySpark engineer,vishweswarrao@gmail.com,0,False


#### **7) Extract the Substring Position Using locate and substring**

In [0]:
# Use expr to apply substring and locate together
df_pos = df \
    .withColumn("location", locate("data", col("text"))) \
    .withColumn("data_substring", expr("substring(text, locate('data', text), 13)"))
display(df_pos)

text,mail,location,data_substring
Azure data engineer (ADE),suman@gmail.com,7,data engineer
AWS data engineer (AWS),kiranrathod@gmail.com,5,data engineer
data warehouse,rameshwaran@gmail.com,1,data warehous
GCP engineer,krishnamurthy@gmail.com,0,GCP engineer
PySpark engineer,vishweswarrao@gmail.com,0,PySpark engin


#### **8) Replace Substring After Locating It**

In [0]:
from pyspark.sql.functions import regexp_replace

df_rpl = df \
    .withColumn("location", locate("data", col("text"))) \
    .withColumn("updated_description",  
                when(locate("data", col("text")) > 0, regexp_replace(col("text"), "data", "Data"))
                .otherwise(col("text")))
display(df_rpl)

text,mail,location,updated_description
Azure data engineer (ADE),suman@gmail.com,7,Azure Data engineer (ADE)
AWS data engineer (AWS),kiranrathod@gmail.com,5,AWS Data engineer (AWS)
data warehouse,rameshwaran@gmail.com,1,Data warehouse
GCP engineer,krishnamurthy@gmail.com,0,GCP engineer
PySpark engineer,vishweswarrao@gmail.com,0,PySpark engineer


#### **9) Multiple Substring Search Using locate()**

In [0]:
# Find positions of both 'apple' and 'banana' in 'description' column
df_mltpl_str = df_rpl \
       .withColumn("data_position", locate("Data", col("updated_description"))) \
       .withColumn("text_position", locate("engineer", col("text"))) \
       .select("text", "text_position", "mail", "updated_description", "data_position")

display(df_mltpl_str)

text,text_position,mail,updated_description,data_position
Azure data engineer (ADE),12,suman@gmail.com,Azure Data engineer (ADE),7
AWS data engineer (AWS),10,kiranrathod@gmail.com,AWS Data engineer (AWS),5
data warehouse,0,rameshwaran@gmail.com,Data warehouse,1
GCP engineer,5,krishnamurthy@gmail.com,GCP engineer,0
PySpark engineer,9,vishweswarrao@gmail.com,PySpark engineer,0


#### **10) Using locate() with selectExpr**

In [0]:
df_new = df.selectExpr(
    "text",
    "locate('data', text) as data_position"
)

display(df_new)

text,data_position
Azure data engineer (ADE),7
AWS data engineer (AWS),5
data warehouse,1
GCP engineer,0
PySpark engineer,0


#### **11) Using locate() with expr**

In [0]:
# Use expr to create a column with locate expression
df_expr = df.withColumn("position_expr", expr("locate('data', text)"))
display(df_expr)

text,mail,updated_description,position_expr
Azure data engineer (ADE),suman@gmail.com,Azure date engineer (ADE),7
AWS data engineer (AWS),kiranrathod@gmail.com,AWS date engineer (AWS),5
data warehouse,rameshwaran@gmail.com,date warehouse,1
GCP engineer,krishnamurthy@gmail.com,GCP engineer,0
PySpark engineer,vishweswarrao@gmail.com,PySpark engineer,0
