#### **PROBLEM STATEMENT**

**How to extract portion of text after delimiter?**
 
 1) Read this **(, delimited) csv file**, create a dataframe
 2) Extract Column2 **(portion which is after the pipe)** along with column3
 3) Rename 2nd part of column2 to **ErrorCode** and column3 to **Count**
 
         Ex Output:
         ErrorCode, Count
         b3344002000,1.0
 4) Convert Column **Count** data type to **int**
 5) perform **Distinct** on dataframe created in step 4
 6) Add new column by name **ExecutionDate** having **today's date**.
 7) Add new column by name **Environment** having constant value **"Staging"**.

#### **1) Read this (, delimited) csv file, create a dataframe**

In [0]:
df = spark.read.csv("/FileStore/tables/split.txt", header=True)
df.show()
df.printSchema()

+-------------+--------------------+-------+
|      column1|             column2|column3|
+-------------+--------------------+-------+
|         2234|    ec-lookup | e030|    1.0|
|    224566634|ec-lookup | 00000456|    1.0|
|   8899992234| ec-lookup | 0x99999|    5.0|
|    678882234|  ec-lookup | 002000|    1.0|
|   8899992234| ec-lookup | 0x99999|    5.0|
|    678882234|  ec-lookup | 002000|    1.0|
|   0099992234| ec-lookup | 0x99999|    5.0|
|  99678882234|ec-lookup | sx-LB000|    1.0|
| 998899992234|ec-lookup | 0xbx9...|    5.0|
|    878882234|ec-lookup | b3344...|    1.0|
+-------------+--------------------+-------+

root
 |--  column1: string (nullable = true)
 |-- column2: string (nullable = true)
 |-- column3: string (nullable = true)



#### **2) Extract 'Column2' (portion which is after the pipe) along with 'column3'**

**Method 01: split()**

- Split

      # Extract text after the delimiter
      df_with_extracted = df.withColumn("Extracted", split(df["column2"], r"\|").getItem(1))
                                           (or)
      # Extract text after the delimiter "|"
      df = df.withColumn("New_Column", split(df["column2"], r"\|")[1].alias("ExtractedText"))
                                          (or)
      df = df.withColumn("New_Column", split(df["column2"],'\|')[1].alias("ExtractedText"))

**Method 02: regexp_extract()**

- The regexp_extract() function extracts substrings based on a regular expression.

      from pyspark.sql.functions import regexp_extract

      # Extract text after the delimiter using regex
      df_with_extracted = df.withColumn("Extracted", regexp_extract(df["column2"], r"\| (.+)", 1))

**Method 03: substring_index()**

- The substring_index() function retrieves portions of a string relative to a delimiter.
     
      from pyspark.sql.functions import substring_index

      # Extract text after the last occurrence of the delimiter
      df_with_extracted = df.withColumn("Extracted", substring_index(df["column2"], "|", -1).alias("Extracted"))

**Method 04: expr()**
- expr()

      from pyspark.sql.functions import expr
      
      # Extract text after the delimiter
      df_with_extracted = df.withColumn("Extracted", expr("split(column2, '\\|')[1]"))

In [0]:
from pyspark.sql.functions import split, size, col
from pyspark.sql.types import IntegerType

In [0]:
# Split function splits a string column into an array based on the specified delimiter
df_split = df.withColumn("New_Column", split(df["column2"],'\|'))
df_split.show(truncate=False)
df_split.printSchema()

+-------------+-----------------------+-------+--------------------------+
| column1     |column2                |column3|New_Column                |
+-------------+-----------------------+-------+--------------------------+
| 2234        |ec-lookup | e030       |1.0    |[ec-lookup ,  e030]       |
| 224566634   |ec-lookup | 00000456   |1.0    |[ec-lookup ,  00000456]   |
| 8899992234  |ec-lookup | 0x99999    |5.0    |[ec-lookup ,  0x99999]    |
| 678882234   |ec-lookup | 002000     |1.0    |[ec-lookup ,  002000]     |
| 8899992234  |ec-lookup | 0x99999    |5.0    |[ec-lookup ,  0x99999]    |
| 678882234   |ec-lookup | 002000     |1.0    |[ec-lookup ,  002000]     |
| 0099992234  |ec-lookup | 0x99999    |5.0    |[ec-lookup ,  0x99999]    |
| 99678882234 |ec-lookup | sx-LB000   |1.0    |[ec-lookup ,  sx-LB000]   |
| 998899992234|ec-lookup | 0xbx99999  |5.0    |[ec-lookup ,  0xbx99999]  |
| 878882234   |ec-lookup | b3344002000|1.0    |[ec-lookup ,  b3344002000]|
+-------------+----------

In [0]:
df = df.withColumn("New_Column", split(df["column2"],'\|')[1])
df.show(truncate=False)

+-------------+-----------------------+-------+------------+
| column1     |column2                |column3|New_Column  |
+-------------+-----------------------+-------+------------+
| 2234        |ec-lookup | e030       |1.0    | e030       |
| 224566634   |ec-lookup | 00000456   |1.0    | 00000456   |
| 8899992234  |ec-lookup | 0x99999    |5.0    | 0x99999    |
| 678882234   |ec-lookup | 002000     |1.0    | 002000     |
| 8899992234  |ec-lookup | 0x99999    |5.0    | 0x99999    |
| 678882234   |ec-lookup | 002000     |1.0    | 002000     |
| 0099992234  |ec-lookup | 0x99999    |5.0    | 0x99999    |
| 99678882234 |ec-lookup | sx-LB000   |1.0    | sx-LB000   |
| 998899992234|ec-lookup | 0xbx99999  |5.0    | 0xbx99999  |
| 878882234   |ec-lookup | b3344002000|1.0    | b3344002000|
+-------------+-----------------------+-------+------------+



**split():**
- Splits **column2 into an array** using the **|** delimiter.

**r"\|":**
- Escapes the | character (special in regex).

**split(df["column2"], r"\|")[1]:**
- Selects the portion of text after the delimiter.

**withColumn():**
- Creates a new column (ExtractedText) with the extracted portion.

#### **3) Rename 2nd part of 'column2' to 'ErrorCode' and 'column3' to 'Count'**
**Ex: Output**
- ErrorCode, Count as b3344002000, 1.0

In [0]:
df = df.withColumnRenamed("New_Column", "ErrorCode")\
       .withColumnRenamed("column3", "Count")
df.show(truncate=False)
df.printSchema()

+-------------+-----------------------+-----+------------+
| column1     |column2                |Count|ErrorCode   |
+-------------+-----------------------+-----+------------+
| 2234        |ec-lookup | e030       |1.0  | e030       |
| 224566634   |ec-lookup | 00000456   |1.0  | 00000456   |
| 8899992234  |ec-lookup | 0x99999    |5.0  | 0x99999    |
| 678882234   |ec-lookup | 002000     |1.0  | 002000     |
| 8899992234  |ec-lookup | 0x99999    |5.0  | 0x99999    |
| 678882234   |ec-lookup | 002000     |1.0  | 002000     |
| 0099992234  |ec-lookup | 0x99999    |5.0  | 0x99999    |
| 99678882234 |ec-lookup | sx-LB000   |1.0  | sx-LB000   |
| 998899992234|ec-lookup | 0xbx99999  |5.0  | 0xbx99999  |
| 878882234   |ec-lookup | b3344002000|1.0  | b3344002000|
+-------------+-----------------------+-----+------------+

root
 |--  column1: string (nullable = true)
 |-- column2: string (nullable = true)
 |-- Count: string (nullable = true)
 |-- ErrorCode: string (nullable = true)



#### **4) Convert Column 'Count' data type to 'int'**

In [0]:
df = df.withColumn("Count", df["Count"].cast(IntegerType()))
df.show(truncate=False)
df.printSchema()
print("Number of Rows:", df.count())

+-------------+-----------------------+-----+------------+
| column1     |column2                |Count|ErrorCode   |
+-------------+-----------------------+-----+------------+
| 2234        |ec-lookup | e030       |1    | e030       |
| 224566634   |ec-lookup | 00000456   |1    | 00000456   |
| 8899992234  |ec-lookup | 0x99999    |5    | 0x99999    |
| 678882234   |ec-lookup | 002000     |1    | 002000     |
| 8899992234  |ec-lookup | 0x99999    |5    | 0x99999    |
| 678882234   |ec-lookup | 002000     |1    | 002000     |
| 0099992234  |ec-lookup | 0x99999    |5    | 0x99999    |
| 99678882234 |ec-lookup | sx-LB000   |1    | sx-LB000   |
| 998899992234|ec-lookup | 0xbx99999  |5    | 0xbx99999  |
| 878882234   |ec-lookup | b3344002000|1    | b3344002000|
+-------------+-----------------------+-----+------------+

root
 |--  column1: string (nullable = true)
 |-- column2: string (nullable = true)
 |-- Count: integer (nullable = true)
 |-- ErrorCode: string (nullable = true)

Number of

#### **5) Perform Distinct on dataframe created in step 4**

In [0]:
df_distinct = df.distinct()
df_distinct.show()
print("Number of Rows", df_distinct.count())

+-------------+--------------------+-----+------------+
|      column1|             column2|Count|   ErrorCode|
+-------------+--------------------+-----+------------+
|    224566634|ec-lookup | 00000456|    1|    00000456|
|    678882234|  ec-lookup | 002000|    1|      002000|
|    878882234|ec-lookup | b3344...|    1| b3344002000|
|         2234|    ec-lookup | e030|    1|        e030|
|  99678882234|ec-lookup | sx-LB000|    1|    sx-LB000|
|   8899992234| ec-lookup | 0x99999|    5|     0x99999|
|   0099992234| ec-lookup | 0x99999|    5|     0x99999|
| 998899992234|ec-lookup | 0xbx9...|    5|   0xbx99999|
+-------------+--------------------+-----+------------+

Number of Rows 8


#### **6) Add new column by name 'ExecutionDate' having today's date**

In [0]:
from pyspark.sql.functions import current_date
df1 = df_distinct.withColumn("ExecutionDate", current_date())
df1.show()

+-------------+--------------------+-----+------------+-------------+
|      column1|             column2|Count|   ErrorCode|ExecutionDate|
+-------------+--------------------+-----+------------+-------------+
|    224566634|ec-lookup | 00000456|    1|    00000456|   2024-07-22|
|    678882234|  ec-lookup | 002000|    1|      002000|   2024-07-22|
|    878882234|ec-lookup | b3344...|    1| b3344002000|   2024-07-22|
|         2234|    ec-lookup | e030|    1|        e030|   2024-07-22|
|  99678882234|ec-lookup | sx-LB000|    1|    sx-LB000|   2024-07-22|
|   8899992234| ec-lookup | 0x99999|    5|     0x99999|   2024-07-22|
|   0099992234| ec-lookup | 0x99999|    5|     0x99999|   2024-07-22|
| 998899992234|ec-lookup | 0xbx9...|    5|   0xbx99999|   2024-07-22|
+-------------+--------------------+-----+------------+-------------+



#### **7) Add new column by name 'Environment' having constant value "Staging"**

In [0]:
from pyspark.sql.functions import lit
df1 = df1.withColumn("Environment", lit("Staging"))
df1.show()

+-------------+--------------------+-----+------------+-------------+-----------+
|      column1|             column2|Count|   ErrorCode|ExecutionDate|Environment|
+-------------+--------------------+-----+------------+-------------+-----------+
|    224566634|ec-lookup | 00000456|    1|    00000456|   2024-07-22|    Staging|
|    678882234|  ec-lookup | 002000|    1|      002000|   2024-07-22|    Staging|
|    878882234|ec-lookup | b3344...|    1| b3344002000|   2024-07-22|    Staging|
|         2234|    ec-lookup | e030|    1|        e030|   2024-07-22|    Staging|
|  99678882234|ec-lookup | sx-LB000|    1|    sx-LB000|   2024-07-22|    Staging|
|   8899992234| ec-lookup | 0x99999|    5|     0x99999|   2024-07-22|    Staging|
|   0099992234| ec-lookup | 0x99999|    5|     0x99999|   2024-07-22|    Staging|
| 998899992234|ec-lookup | 0xbx9...|    5|   0xbx99999|   2024-07-22|    Staging|
+-------------+--------------------+-----+------------+-------------+-----------+

