# PySpark - Search and Filter Dataframes Techniques
* Notebook by Adam Lang
* Date: 1/3/2024

# Overview
* In this notebook we will go over various techniques and best practices for searching and filtering BIG DATA in dataframes using PySpark.
* Some of these functions include but are not limited to:

1. PySparkSQL functions
2. Select method
3. Order By
4. Like operator (searching strings)
5. Substring searches
6. Is In operator
7. Starts with, Ends with
8. Slicing
9. Filtering
10. Collecting results as objects

# Create Spark Session

In [1]:
## spark session
import pyspark
from pyspark.sql import SparkSession

## init spark session
spark = SparkSession.builder.appName("SearchAndFilter").getOrCreate()

In [2]:
## view spark session
spark

# Read in Dataframe
* This is a dataset of fifa players.

In [4]:
## set path
path = "/content/drive/MyDrive/Colab Notebooks/PySpark Data Science/"

# load df
fifa = spark.read.csv(path+'fifa19.csv',inferSchema=True,header=True)


In [5]:
## view head of spark df
fifa.limit(5).toPandas()

Unnamed: 0,_c0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,...,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Release Clause
0,0,158023,L. Messi,31,https://cdn.sofifa.org/players/4/19/158023.png,Argentina,https://cdn.sofifa.org/flags/52.png,94,94,FC Barcelona,...,96,33,28,26,6,11,15,14,8,€226.5M
1,1,20801,Cristiano Ronaldo,33,https://cdn.sofifa.org/players/4/19/20801.png,Portugal,https://cdn.sofifa.org/flags/38.png,94,94,Juventus,...,95,28,31,23,7,11,15,14,11,€127.1M
2,2,190871,Neymar Jr,26,https://cdn.sofifa.org/players/4/19/190871.png,Brazil,https://cdn.sofifa.org/flags/54.png,92,93,Paris Saint-Germain,...,94,27,24,33,9,9,15,15,11,€228.1M
3,3,193080,De Gea,27,https://cdn.sofifa.org/players/4/19/193080.png,Spain,https://cdn.sofifa.org/flags/45.png,91,93,Manchester United,...,68,15,21,13,90,85,87,88,94,€138.6M
4,4,192985,K. De Bruyne,27,https://cdn.sofifa.org/players/4/19/192985.png,Belgium,https://cdn.sofifa.org/flags/7.png,91,92,Manchester City,...,88,68,58,51,15,13,5,10,13,€196.4M


In [6]:
## lets see columns
fifa.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- ID: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Photo: string (nullable = true)
 |-- Nationality: string (nullable = true)
 |-- Flag: string (nullable = true)
 |-- Overall: integer (nullable = true)
 |-- Potential: integer (nullable = true)
 |-- Club: string (nullable = true)
 |-- Club Logo: string (nullable = true)
 |-- Value: string (nullable = true)
 |-- Wage: string (nullable = true)
 |-- Special: integer (nullable = true)
 |-- Preferred Foot: string (nullable = true)
 |-- International Reputation: integer (nullable = true)
 |-- Weak Foot: integer (nullable = true)
 |-- Skill Moves: integer (nullable = true)
 |-- Work Rate: string (nullable = true)
 |-- Body Type: string (nullable = true)
 |-- Real Face: string (nullable = true)
 |-- Position: string (nullable = true)
 |-- Jersey Number: integer (nullable = true)
 |-- Joined: string (nullable = true)
 |-- Loaned From: string (nu

## SQL Functions in PySpark

### Select

In [7]:
## load sql functions
from pyspark.sql.functions import *

In [9]:
## select a few variables -- adding 'False' prevents string truncation in .show()
fifa.select(['Nationality','Name','Age','Photo']).show(5, False)

+-----------+-----------------+---+----------------------------------------------+
|Nationality|Name             |Age|Photo                                         |
+-----------+-----------------+---+----------------------------------------------+
|Argentina  |L. Messi         |31 |https://cdn.sofifa.org/players/4/19/158023.png|
|Portugal   |Cristiano Ronaldo|33 |https://cdn.sofifa.org/players/4/19/20801.png |
|Brazil     |Neymar Jr        |26 |https://cdn.sofifa.org/players/4/19/190871.png|
|Spain      |De Gea           |27 |https://cdn.sofifa.org/players/4/19/193080.png|
|Belgium    |K. De Bruyne     |27 |https://cdn.sofifa.org/players/4/19/192985.png|
+-----------+-----------------+---+----------------------------------------------+
only showing top 5 rows



### OrderBY

In [10]:
## order by function
fifa.select(['Name','Age']).orderBy(fifa['Age']).show(5,False)

+-----------+---+
|Name       |Age|
+-----------+---+
|Y. Roemer  |16 |
|W. Geubbels|16 |
|J. Kitolano|16 |
|A. Taoui   |16 |
|Y. Begraoui|16 |
+-----------+---+
only showing top 5 rows



In [11]:
## lets find oldest player
fifa.select(['Name','Age']).orderBy(fifa['Age'].desc()).show(5)

+-------------+---+
|         Name|Age|
+-------------+---+
|     O. Pérez| 45|
|    T. Warner| 44|
|K. Pilkington| 44|
|  S. Narazaki| 42|
|     M. Tyler| 41|
+-------------+---+
only showing top 5 rows



In [12]:
## youngest player? -- notice this is the same as using the default orderBy function
fifa.select(['Name','Age']).orderBy(fifa['Age'].asc()).show(5)

+-----------+---+
|       Name|Age|
+-----------+---+
|  Y. Roemer| 16|
|W. Geubbels| 16|
|J. Kitolano| 16|
|   A. Taoui| 16|
|Y. Begraoui| 16|
+-----------+---+
only showing top 5 rows



### Like operator

In [13]:
## all players with 'Barcelona' value in club title
fifa.select(['Name','Club']).where(fifa.Club.like("%Barcelona%")).show(5,False)

+---------------+------------+
|Name           |Club        |
+---------------+------------+
|L. Messi       |FC Barcelona|
|L. Suárez      |FC Barcelona|
|M. ter Stegen  |FC Barcelona|
|Sergio Busquets|FC Barcelona|
|Coutinho       |FC Barcelona|
+---------------+------------+
only showing top 5 rows



### Substring Filters


In [15]:
## return part of a string --> photo column
## -4 go back 4 chars then go forward 4 chars
fifa.select("Photo",fifa.Photo.substr(-4,4)).show(5, False)

+----------------------------------------------+-----------------------+
|Photo                                         |substring(Photo, -4, 4)|
+----------------------------------------------+-----------------------+
|https://cdn.sofifa.org/players/4/19/158023.png|.png                   |
|https://cdn.sofifa.org/players/4/19/20801.png |.png                   |
|https://cdn.sofifa.org/players/4/19/190871.png|.png                   |
|https://cdn.sofifa.org/players/4/19/193080.png|.png                   |
|https://cdn.sofifa.org/players/4/19/192985.png|.png                   |
+----------------------------------------------+-----------------------+
only showing top 5 rows



### Isin filters

In [18]:
## isin filtering
fifa[fifa.Club.isin("FC Barcelona", "Juventus")].limit(4).toPandas()

Unnamed: 0,_c0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,...,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Release Clause
0,0,158023,L. Messi,31,https://cdn.sofifa.org/players/4/19/158023.png,Argentina,https://cdn.sofifa.org/flags/52.png,94,94,FC Barcelona,...,96,33,28,26,6,11,15,14,8,€226.5M
1,1,20801,Cristiano Ronaldo,33,https://cdn.sofifa.org/players/4/19/20801.png,Portugal,https://cdn.sofifa.org/flags/38.png,94,94,Juventus,...,95,28,31,23,7,11,15,14,11,€127.1M
2,7,176580,L. Suárez,31,https://cdn.sofifa.org/players/4/19/176580.png,Uruguay,https://cdn.sofifa.org/flags/60.png,91,91,FC Barcelona,...,85,62,45,38,27,25,31,33,37,€164M
3,15,211110,P. Dybala,24,https://cdn.sofifa.org/players/4/19/211110.png,Argentina,https://cdn.sofifa.org/flags/52.png,89,94,Juventus,...,84,23,20,20,5,4,4,5,8,€153.5M


### Starts with, Ends with operators

In [19]:
## lets find Names that start with L and end with i
fifa.select("Name","Club").where(fifa.Name.startswith("L")).where(fifa.Name.endswith("i")).show(5)

+-------------+---------------+
|         Name|           Club|
+-------------+---------------+
|     L. Messi|   FC Barcelona|
|   L. Bonucci|       Juventus|
| L. Fabiański|West Ham United|
|L. Pellegrini|           Roma|
| L. Pavoletti|       Cagliari|
+-------------+---------------+
only showing top 5 rows



### Slicing

In [21]:
## count rows first
fifa.count()

18207

In [22]:
## slice the df
df_slice = fifa.limit(200)
df_slice.count()

200

In [27]:
## slice columns --> create a list first
col_list = fifa.columns[0:10] ## slice first 10 cols
df_slice2 = fifa.select(col_list) ## select the slice

## show first 5 without truncation
df_slice2.show(5,False)

+---+------+-----------------+---+----------------------------------------------+-----------+-----------------------------------+-------+---------+-------------------+
|_c0|ID    |Name             |Age|Photo                                         |Nationality|Flag                               |Overall|Potential|Club               |
+---+------+-----------------+---+----------------------------------------------+-----------+-----------------------------------+-------+---------+-------------------+
|0  |158023|L. Messi         |31 |https://cdn.sofifa.org/players/4/19/158023.png|Argentina  |https://cdn.sofifa.org/flags/52.png|94     |94       |FC Barcelona       |
|1  |20801 |Cristiano Ronaldo|33 |https://cdn.sofifa.org/players/4/19/20801.png |Portugal   |https://cdn.sofifa.org/flags/38.png|94     |94       |Juventus           |
|2  |190871|Neymar Jr        |26 |https://cdn.sofifa.org/players/4/19/190871.png|Brazil     |https://cdn.sofifa.org/flags/54.png|92     |93       |Paris Saint-G

In [28]:
## verify num of cols
len(df_slice2.columns)

10

In [29]:
## create random df
df = spark.createDataFrame([([1,2,3],),([4,5],)],['x'])
df.show()

+---------+
|        x|
+---------+
|[1, 2, 3]|
|   [4, 5]|
+---------+



Summary
* Above we have arrays within a column. Lets slice these.

In [31]:
## slice arrays --> index starts with 1 in PySpark unlike Python
df.select(slice(df.x,2,2).alias('Sliced Cols')).show()

+-----------+
|Sliced Cols|
+-----------+
|     [2, 3]|
|        [5]|
+-----------+



### Filter

In [33]:
## filtering
fifa.filter("Overall>50").limit(5).toPandas()

Unnamed: 0,_c0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,...,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Release Clause
0,0,158023,L. Messi,31,https://cdn.sofifa.org/players/4/19/158023.png,Argentina,https://cdn.sofifa.org/flags/52.png,94,94,FC Barcelona,...,96,33,28,26,6,11,15,14,8,€226.5M
1,1,20801,Cristiano Ronaldo,33,https://cdn.sofifa.org/players/4/19/20801.png,Portugal,https://cdn.sofifa.org/flags/38.png,94,94,Juventus,...,95,28,31,23,7,11,15,14,11,€127.1M
2,2,190871,Neymar Jr,26,https://cdn.sofifa.org/players/4/19/190871.png,Brazil,https://cdn.sofifa.org/flags/54.png,92,93,Paris Saint-Germain,...,94,27,24,33,9,9,15,15,11,€228.1M
3,3,193080,De Gea,27,https://cdn.sofifa.org/players/4/19/193080.png,Spain,https://cdn.sofifa.org/flags/45.png,91,93,Manchester United,...,68,15,21,13,90,85,87,88,94,€138.6M
4,4,192985,K. De Bruyne,27,https://cdn.sofifa.org/players/4/19/192985.png,Belgium,https://cdn.sofifa.org/flags/7.png,91,92,Manchester City,...,88,68,58,51,15,13,5,10,13,€196.4M


In [36]:
## filtering but limit num of columns
fifa.filter("Overall<50").select(['Name','Age']).limit(5).toPandas()

Unnamed: 0,Name,Age
0,D. Collins,17
1,J. Egan,19
2,Xie Xiaofan,20
3,B. Buckley,17
4,G. Figliuzzi,17


### Collecting Results as Objects
* Need to remove item from df

In [37]:
## filtering but limit num of columns
result = fifa.filter("Overall>50").select(['Nationality','Name','Age','Overall']).orderBy(fifa['Overall'].desc()).collect()

In [38]:
## type
type(result[0])

Summary
* We can see the type is a `row` which allows us to filter this way.

In [40]:
## need to collect result as a python object
## if you were to filter on the df itself--> fifa[0][1] --> this would not print result
print("Best Player Over 50 is: ", result[0][1])

Best Player Over 50 is:  L. Messi


In [41]:
## worst player
print("Worst Player Over 50 is: ", result[-1][1])

Worst Player Over 50 is:  C. Addai
