<a href="https://colab.research.google.com/github/deepavasanthkumar/deepcodesnippets/blob/master/pyspark_match_like_ilike_rlike_notLike.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [11]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [12]:
import pyspark
from pyspark.sql import SparkSession
  
spark = SparkSession.builder.appName("Spark Window Functions ").getOrCreate()
spark

In [13]:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data1 = [("James","","Smith","36636","M", 1000, "Sales", 2020),
    ("Michael","Rose","","40288","M", 2000, "Operations",2020),
    ("Robert","","Williams","42114","M", 3000, "Sales",2020),
    ("Maria","Anne","Jones","39192","F", 4000, "Operations",2020),
  ("Ria","Anne","Jones","60000","F", 7000, "Operations",2020)
  
  ]
 
schema1 = StructType([ \
    StructField("firstname",StringType(),True), \
    StructField("middlename",StringType(),True), \
    StructField("lastname",StringType(),True), \
    StructField("id", StringType(), True), \
    StructField("gender", StringType(), True),
    StructField("annualsalary", IntegerType(), True),
    StructField("work", StringType(), True),
    StructField("year", IntegerType(), True),
   
  ])

df1 = spark.createDataFrame(data=data1,schema=schema1)
df1.printSchema()
df1.show(truncate=False)

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- annualsalary: integer (nullable = true)
 |-- work: string (nullable = true)
 |-- year: integer (nullable = true)

+---------+----------+--------+-----+------+------------+----------+----+
|firstname|middlename|lastname|id   |gender|annualsalary|work      |year|
+---------+----------+--------+-----+------+------------+----------+----+
|James    |          |Smith   |36636|M     |1000        |Sales     |2020|
|Michael  |Rose      |        |40288|M     |2000        |Operations|2020|
|Robert   |          |Williams|42114|M     |3000        |Sales     |2020|
|Maria    |Anne      |Jones   |39192|F     |4000        |Operations|2020|
|Ria      |Anne      |Jones   |60000|F     |7000        |Operations|2020|
+---------+----------+--------+-----+------+------------+----------+----+



#ILIKE 

SQL ILIKE expression (case insensitive LIKE). 
Returns a boolean Column based on a case ***insensitive*** match.


In [14]:
df1.filter(df1.firstname.ilike('%Ria')).collect()


[Row(firstname='Maria', middlename='Anne', lastname='Jones', id='39192', gender='F', annualsalary=4000, work='Operations', year=2020),
 Row(firstname='Ria', middlename='Anne', lastname='Jones', id='60000', gender='F', annualsalary=7000, work='Operations', year=2020)]

#RLIKE

We can get similar match with **RLIKE**

In [25]:

df1.filter(df1.firstname.rlike('(?i)Ria$')).collect()

[Row(firstname='Maria', middlename='Anne', lastname='Jones', id='39192', gender='F', annualsalary=4000, work='Operations', year=2020),
 Row(firstname='Ria', middlename='Anne', lastname='Jones', id='60000', gender='F', annualsalary=7000, work='Operations', year=2020)]

#LIKE

Case sensitive match

In [16]:
df1.filter(df1.firstname.like('%Ria')).collect()

[Row(firstname='Ria', middlename='Anne', lastname='Jones', id='60000', gender='F', annualsalary=7000, work='Operations', year=2020)]

##with like as expression

In [19]:
df = df1.filter("firstname like '%Ria%'")
df.collect()

[Row(firstname='Ria', middlename='Anne', lastname='Jones', id='60000', gender='F', annualsalary=7000, work='Operations', year=2020)]

#Not Like 

There is nothing like notlike function, however negation of Like can be used to achieve this, 
using the **'~'**operator


In [17]:
df1.filter(~ df1.firstname.like('%Ria')).collect()

[Row(firstname='James', middlename='', lastname='Smith', id='36636', gender='M', annualsalary=1000, work='Sales', year=2020),
 Row(firstname='Michael', middlename='Rose', lastname='', id='40288', gender='M', annualsalary=2000, work='Operations', year=2020),
 Row(firstname='Robert', middlename='', lastname='Williams', id='42114', gender='M', annualsalary=3000, work='Sales', year=2020),
 Row(firstname='Maria', middlename='Anne', lastname='Jones', id='39192', gender='F', annualsalary=4000, work='Operations', year=2020)]

##SQL we can use **not like **

In [18]:
df = df1.filter("firstname not like '%Ria%'")
df.collect()

[Row(firstname='James', middlename='', lastname='Smith', id='36636', gender='M', annualsalary=1000, work='Sales', year=2020),
 Row(firstname='Michael', middlename='Rose', lastname='', id='40288', gender='M', annualsalary=2000, work='Operations', year=2020),
 Row(firstname='Robert', middlename='', lastname='Williams', id='42114', gender='M', annualsalary=3000, work='Sales', year=2020),
 Row(firstname='Maria', middlename='Anne', lastname='Jones', id='39192', gender='F', annualsalary=4000, work='Operations', year=2020)]