# Filter Operations

Filter operation are used to filter rows in a dataframe according to conditions given. The filter operation allows you to narrow down the dataset based on certain criteria. 

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName('Filter_Operation').getOrCreate()

In [3]:
df_pyspark = spark.read.csv('data/names_and_ages.csv', header = True , inferSchema = True , sep = ';')
df_pyspark.show(5)

+-------+---+----------+----------+------+--------------------+
|   Name|Age|Experience|Salary_USD|ID job|    Current Position|
+-------+---+----------+----------+------+--------------------+
|  Alice| 25|         2|      2911|     9|    Graphic Designer|
|    Bob| 30|         4|      3443|     2|      Data Scientist|
|Charlie| 22|         7|      7034|     5|   Marketing Manager|
|  David| 35|        12|      9118|     6|   Financial Analyst|
|   Emma| 28|         9|     12455|     7|Human Resources S...|
+-------+---+----------+----------+------+--------------------+
only showing top 5 rows



From the `names_and_games.csv` file, we want to select only those rows where the `Salary_USD` is less than or equal to 8500.

In [5]:
# Filtering
df_pyspark.filter('Salary_USD<=8500').show()

+--------+---+----------+----------+------+--------------------+
|    Name|Age|Experience|Salary_USD|ID job|    Current Position|
+--------+---+----------+----------+------+--------------------+
|   Alice| 25|         2|      2911|     9|    Graphic Designer|
|     Bob| 30|         4|      3443|     2|      Data Scientist|
| Charlie| 22|         7|      7034|     5|   Marketing Manager|
|   Grace| 23|         3|      2443|     7|Human Resources S...|
|   Henry| 32|        14|      7750|    10|  Operations Manager|
|   Irene| 27|        25|      3635|     7|Human Resources S...|
|   Karen| 26|         3|      1940|     7|Human Resources S...|
|     Leo| 29|         1|      4865|     8|Customer Service ...|
|   Maria| 31|         0|      1883|     8|Customer Service ...|
|  Nathan| 37|         0|      7096|     3|     Project Manager|
|  Olivia| 24|         3|      6736|     9|    Graphic Designer|
|    Paul| 38|         1|      2120|     3|     Project Manager|
|  Rachel| 34|         3|

If we only want to select the columns `Name` and `Age` with a filter condition where `Salary_USD` is less than or equal to 8500, we write:

In [6]:
df_pyspark.filter('Salary_USD<=8500').select(['Name' , 'Age']).show()

+--------+---+
|    Name|Age|
+--------+---+
|   Alice| 25|
|     Bob| 30|
| Charlie| 22|
|   Frank| 40|
|   Grace| 23|
|   Henry| 32|
|   Irene| 27|
|   Karen| 26|
|     Leo| 29|
|   Maria| 31|
|  Nathan| 37|
|  Olivia| 24|
|    Paul| 38|
|  Rachel| 34|
|     Sam| 39|
|  Taylor| 36|
| Ulysses| 42|
|Victoria| 20|
|  Walter| 45|
| Zachary| 29|
+--------+---+
only showing top 20 rows



We can also use the following command to achieve the same result.

In [8]:
df_pyspark.filter(df_pyspark['Salary_USD']<=8500).show()

+--------+---+----------+----------+------+--------------------+
|    Name|Age|Experience|Salary_USD|ID job|    Current Position|
+--------+---+----------+----------+------+--------------------+
|   Alice| 25|         2|      2911|     9|    Graphic Designer|
|     Bob| 30|         4|      3443|     2|      Data Scientist|
| Charlie| 22|         7|      7034|     5|   Marketing Manager|
|   Frank| 40|         1|      8372|     1|   Software Engineer|
|   Grace| 23|         3|      2443|     7|Human Resources S...|
|   Henry| 32|        14|      7750|    10|  Operations Manager|
|   Irene| 27|        25|      3635|     7|Human Resources S...|
|   Karen| 26|         3|      1940|     7|Human Resources S...|
|     Leo| 29|         1|      4865|     8|Customer Service ...|
|   Maria| 31|         0|      1883|     8|Customer Service ...|
|  Nathan| 37|         0|      7096|     3|     Project Manager|
|  Olivia| 24|         3|      6736|     9|    Graphic Designer|
|    Paul| 38|         1|

To apply a filter based on two conditions that must occur simultaneously, we use the `&` operator. So, we want to select those rows where the `Salary_USD` is more or equal than 8500 and less or equal than 12000.

In [11]:
df_pyspark.filter((df_pyspark['Salary_USD']<=12000) &
                   (df_pyspark['Salary_USD']>=8500)).show()

+------+---+----------+----------+------+--------------------+
|  Name|Age|Experience|Salary_USD|ID job|    Current Position|
+------+---+----------+----------+------+--------------------+
| David| 35|        12|      9118|     6|   Financial Analyst|
|Daniel| 35|        21|     11092|     2|      Data Scientist|
|  Finn| 30|         9|     11485|     5|   Marketing Manager|
| Jacob| 28|         2|     10900|     4|     Sales Associate|
| Riley| 29|        19|     11555|     7|Human Resources S...|
|Sophia| 41|        18|      8562|     5|   Marketing Manager|
|Ursula| 26|         2|     10230|    10|  Operations Manager|
+------+---+----------+----------+------+--------------------+



We can also consider the or operator `|`, so that we want to select those rows where the Salary_USD is less or equal than 2000 or more or equal than 15000.

In [20]:
df_pyspark.filter((df_pyspark['Salary_USD']<=2000) |
                   (df_pyspark['Salary_USD']>=15000)).show(40)

+--------+---+----------+----------+------+--------------------+
|    Name|Age|Experience|Salary_USD|ID job|    Current Position|
+--------+---+----------+----------+------+--------------------+
|    Jack| 33|         2|     15356|     9|    Graphic Designer|
|   Karen| 26|         3|      1940|     7|Human Resources S...|
|   Maria| 31|         0|      1883|     8|Customer Service ...|
|   Quinn| 21|        14|     16975|     3|     Project Manager|
|  Xander| 28|         2|     19054|     9|    Graphic Designer|
|Isabella| 36|         7|      1965|     4|     Sales Associate|
|   Kylie| 25|         4|      1602|     1|   Software Engineer|
|   Megan| 22|        12|     17440|     7|Human Resources S...|
|  Willow| 33|        17|     16315|     3|     Project Manager|
+--------+---+----------+----------+------+--------------------+



The not operation is considered with the symbol `~`, so if we want to select those values of `Salary_USD` which are not less or equal than 8000, we use:

In [21]:
df_pyspark.filter(~(df_pyspark['Salary_USD']<=8000)).show()

+--------+---+----------+----------+------+--------------------+
|    Name|Age|Experience|Salary_USD|ID job|    Current Position|
+--------+---+----------+----------+------+--------------------+
|   David| 35|        12|      9118|     6|   Financial Analyst|
|    Emma| 28|         9|     12455|     7|Human Resources S...|
|   Frank| 40|         1|      8372|     1|   Software Engineer|
|    Jack| 33|         2|     15356|     9|    Graphic Designer|
|   Quinn| 21|        14|     16975|     3|     Project Manager|
|Victoria| 20|         6|      8174|    10|  Operations Manager|
|  Xander| 28|         2|     19054|     9|    Graphic Designer|
| Yasmine| 31|         8|     12362|    10|  Operations Manager|
|   Chloe| 23|         6|      8412|     6|   Financial Analyst|
|  Daniel| 35|        21|     11092|     2|      Data Scientist|
|    Finn| 30|         9|     11485|     5|   Marketing Manager|
| Giselle| 24|         8|     14278|     9|    Graphic Designer|
|  Hayden| 33|         5|