# **Search and Filter DataFrames in PySpark**

Once we have created our Spark Session, read in the data we want to work and done some basic validation, the next thing you'll want to do is start exploring your dataframe. There are several option in PySpark to do this, so we are going to start with the following in this lecture, and continue to dive deeper in the next several lectures.

### **Agenda:**
- Introduce PySpark SQL functions library;
- Select method;
- Order By;
- Like Operator (for searching a string);
- Substring search;
- Is In Operator;
- Starts with, Ends with;
- Slicing;
- Filtering;
- Collecting Results as Objects.
  
Let's get started!

In [44]:
import pyspark
from pyspark.sql import SparkSession

In [45]:
spark = SparkSession.builder.appName('DatasetSearchAndFilter').getOrCreate()
spark

In [46]:
path = 'data/'
fifa = spark.read.csv(path + 'fifa19.csv', header=True)
print(fifa.printSchema())

root
 |-- _c0: string (nullable = true)
 |-- ID: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- Photo: string (nullable = true)
 |-- Nationality: string (nullable = true)
 |-- Flag: string (nullable = true)
 |-- Overall: string (nullable = true)
 |-- Potential: string (nullable = true)
 |-- Club: string (nullable = true)
 |-- Club Logo: string (nullable = true)
 |-- Value: string (nullable = true)
 |-- Wage: string (nullable = true)
 |-- Special: string (nullable = true)
 |-- Preferred Foot: string (nullable = true)
 |-- International Reputation: string (nullable = true)
 |-- Weak Foot: string (nullable = true)
 |-- Skill Moves: string (nullable = true)
 |-- Work Rate: string (nullable = true)
 |-- Body Type: string (nullable = true)
 |-- Real Face: string (nullable = true)
 |-- Position: string (nullable = true)
 |-- Jersey Number: string (nullable = true)
 |-- Joined: string (nullable = true)
 |-- Loaned From: string (nullable = t

In [47]:
fifa.limit(5).toPandas()

Unnamed: 0,_c0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,...,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Release Clause
0,0,158023,L. Messi,31,https://cdn.sofifa.org/players/4/19/158023.png,Argentina,https://cdn.sofifa.org/flags/52.png,94,94,FC Barcelona,...,96,33,28,26,6,11,15,14,8,€226.5M
1,1,20801,Cristiano Ronaldo,33,https://cdn.sofifa.org/players/4/19/20801.png,Portugal,https://cdn.sofifa.org/flags/38.png,94,94,Juventus,...,95,28,31,23,7,11,15,14,11,€127.1M
2,2,190871,Neymar Jr,26,https://cdn.sofifa.org/players/4/19/190871.png,Brazil,https://cdn.sofifa.org/flags/54.png,92,93,Paris Saint-Germain,...,94,27,24,33,9,9,15,15,11,€228.1M
3,3,193080,De Gea,27,https://cdn.sofifa.org/players/4/19/193080.png,Spain,https://cdn.sofifa.org/flags/45.png,91,93,Manchester United,...,68,15,21,13,90,85,87,88,94,€138.6M
4,4,192985,K. De Bruyne,27,https://cdn.sofifa.org/players/4/19/192985.png,Belgium,https://cdn.sofifa.org/flags/7.png,91,92,Manchester City,...,88,68,58,51,15,13,5,10,13,€196.4M


In [48]:
from pyspark.sql.functions import *

fifa.select(['Nationality', 'Name', 'Age', 'Photo']).show(5, truncate=False)

+-----------+-----------------+---+----------------------------------------------+
|Nationality|Name             |Age|Photo                                         |
+-----------+-----------------+---+----------------------------------------------+
|Argentina  |L. Messi         |31 |https://cdn.sofifa.org/players/4/19/158023.png|
|Portugal   |Cristiano Ronaldo|33 |https://cdn.sofifa.org/players/4/19/20801.png |
|Brazil     |Neymar Jr        |26 |https://cdn.sofifa.org/players/4/19/190871.png|
|Spain      |De Gea           |27 |https://cdn.sofifa.org/players/4/19/193080.png|
|Belgium    |K. De Bruyne     |27 |https://cdn.sofifa.org/players/4/19/192985.png|
+-----------+-----------------+---+----------------------------------------------+
only showing top 5 rows



In [49]:
fifa.select(['Name', 'Age']).orderBy(fifa['Age']).show(10)

+---------------+---+
|           Name|Age|
+---------------+---+
|       A. Doğan| 16|
|       B. Waine| 16|
|     C. Bassett| 16|
|P. Samiec-Talar| 16|
|    L. D'Arrigo| 16|
|      B. Nygren| 16|
|       K. Broda| 16|
|       R. Gómez| 16|
|      J. Olstad| 16|
|       E. Ceide| 16|
+---------------+---+
only showing top 10 rows



In [50]:
fifa.select(['Name', 'Age']).orderBy(fifa['Age'].desc()).show(10)

+-------------+---+
|         Name|Age|
+-------------+---+
|     O. Pérez| 45|
|    T. Warner| 44|
|K. Pilkington| 44|
|  S. Narazaki| 42|
|    J. Villar| 41|
|     B. Nivet| 41|
|     M. Tyler| 41|
| H. Sulaimani| 41|
|     C. Muñoz| 41|
|  S. Phillips| 40|
+-------------+---+
only showing top 10 rows



In [51]:
fifa.select(['Name', 'Club']).where(fifa.Club.like('%Barcelona%')).show()

+---------------+------------+
|           Name|        Club|
+---------------+------------+
|       L. Messi|FC Barcelona|
|      L. Suárez|FC Barcelona|
|  M. ter Stegen|FC Barcelona|
|Sergio Busquets|FC Barcelona|
|       Coutinho|FC Barcelona|
|      S. Umtiti|FC Barcelona|
|     Jordi Alba|FC Barcelona|
|     I. Rakitić|FC Barcelona|
|          Piqué|FC Barcelona|
|       A. Vidal|FC Barcelona|
|     O. Dembélé|FC Barcelona|
|  Sergi Roberto|FC Barcelona|
|         Arthur|FC Barcelona|
|         Malcom|FC Barcelona|
|     C. Lenglet|FC Barcelona|
|        Rafinha|FC Barcelona|
|   J. Cillessen|FC Barcelona|
|  Nélson Semedo|FC Barcelona|
|   Denis Suárez|FC Barcelona|
|          Munir|FC Barcelona|
+---------------+------------+
only showing top 20 rows



In [52]:
fifa.select('Photo', fifa.Photo.substr(-4,4).alias('Photo format')).show(5, False)

+----------------------------------------------+------------+
|Photo                                         |Photo format|
+----------------------------------------------+------------+
|https://cdn.sofifa.org/players/4/19/158023.png|.png        |
|https://cdn.sofifa.org/players/4/19/20801.png |.png        |
|https://cdn.sofifa.org/players/4/19/190871.png|.png        |
|https://cdn.sofifa.org/players/4/19/193080.png|.png        |
|https://cdn.sofifa.org/players/4/19/192985.png|.png        |
+----------------------------------------------+------------+
only showing top 5 rows



In [53]:
fifa[fifa.Club.isin('FC Barcelona', 'Juventus')].limit(5).toPandas()

Unnamed: 0,_c0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,...,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Release Clause
0,0,158023,L. Messi,31,https://cdn.sofifa.org/players/4/19/158023.png,Argentina,https://cdn.sofifa.org/flags/52.png,94,94,FC Barcelona,...,96,33,28,26,6,11,15,14,8,€226.5M
1,1,20801,Cristiano Ronaldo,33,https://cdn.sofifa.org/players/4/19/20801.png,Portugal,https://cdn.sofifa.org/flags/38.png,94,94,Juventus,...,95,28,31,23,7,11,15,14,11,€127.1M
2,7,176580,L. Suárez,31,https://cdn.sofifa.org/players/4/19/176580.png,Uruguay,https://cdn.sofifa.org/flags/60.png,91,91,FC Barcelona,...,85,62,45,38,27,25,31,33,37,€164M
3,15,211110,P. Dybala,24,https://cdn.sofifa.org/players/4/19/211110.png,Argentina,https://cdn.sofifa.org/flags/52.png,89,94,Juventus,...,84,23,20,20,5,4,4,5,8,€153.5M
4,18,192448,M. ter Stegen,26,https://cdn.sofifa.org/players/4/19/192448.png,Germany,https://cdn.sofifa.org/flags/21.png,89,92,FC Barcelona,...,69,25,13,10,87,85,88,85,90,€123.3M


In [54]:
fifa.select(['Name', 'Club', 'Age']).where(fifa.Name.startswith('L')).where(fifa.Name.endswith('i')).show(5)

+-------------+---------------+---+
|         Name|           Club|Age|
+-------------+---------------+---+
|     L. Messi|   FC Barcelona| 31|
|   L. Bonucci|       Juventus| 31|
| L. Fabiański|West Ham United| 33|
|L. Pellegrini|           Roma| 22|
| L. Pavoletti|       Cagliari| 29|
+-------------+---------------+---+
only showing top 5 rows



In [55]:
fifa.count()

18207

In [57]:
df = fifa.limit(100)
df.count()

100

In [59]:
columns = fifa.columns[:5]
df3 = fifa.select(columns)
df3.show()

+---+------+-----------------+---+--------------------+
|_c0|    ID|             Name|Age|               Photo|
+---+------+-----------------+---+--------------------+
|  0|158023|         L. Messi| 31|https://cdn.sofif...|
|  1| 20801|Cristiano Ronaldo| 33|https://cdn.sofif...|
|  2|190871|        Neymar Jr| 26|https://cdn.sofif...|
|  3|193080|           De Gea| 27|https://cdn.sofif...|
|  4|192985|     K. De Bruyne| 27|https://cdn.sofif...|
|  5|183277|        E. Hazard| 27|https://cdn.sofif...|
|  6|177003|        L. Modrić| 32|https://cdn.sofif...|
|  7|176580|        L. Suárez| 31|https://cdn.sofif...|
|  8|155862|     Sergio Ramos| 32|https://cdn.sofif...|
|  9|200389|         J. Oblak| 25|https://cdn.sofif...|
| 10|188545|   R. Lewandowski| 29|https://cdn.sofif...|
| 11|182521|         T. Kroos| 28|https://cdn.sofif...|
| 12|182493|         D. Godín| 32|https://cdn.sofif...|
| 13|168542|      David Silva| 32|https://cdn.sofif...|
| 14|215914|         N. Kanté| 27|https://cdn.so

In [60]:
len(df3.columns)

5

In [69]:
df_new = spark.createDataFrame([([1, 2, 3],),([4,5],)],['x'])
df_new.show()

+---------+
|        x|
+---------+
|[1, 2, 3]|
|   [4, 5]|
+---------+



In [73]:
df_new.select(slice(df_new.x,2,2)).show()
# df_new.select(slice(df_new.x,2,1)).show() # brings one positional number from row

+--------------+
|slice(x, 2, 2)|
+--------------+
|        [2, 3]|
|           [5]|
+--------------+



In [74]:
fifa.filter('Overall>50').limit(5).toPandas()

Unnamed: 0,_c0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,...,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Release Clause
0,0,158023,L. Messi,31,https://cdn.sofifa.org/players/4/19/158023.png,Argentina,https://cdn.sofifa.org/flags/52.png,94,94,FC Barcelona,...,96,33,28,26,6,11,15,14,8,€226.5M
1,1,20801,Cristiano Ronaldo,33,https://cdn.sofifa.org/players/4/19/20801.png,Portugal,https://cdn.sofifa.org/flags/38.png,94,94,Juventus,...,95,28,31,23,7,11,15,14,11,€127.1M
2,2,190871,Neymar Jr,26,https://cdn.sofifa.org/players/4/19/190871.png,Brazil,https://cdn.sofifa.org/flags/54.png,92,93,Paris Saint-Germain,...,94,27,24,33,9,9,15,15,11,€228.1M
3,3,193080,De Gea,27,https://cdn.sofifa.org/players/4/19/193080.png,Spain,https://cdn.sofifa.org/flags/45.png,91,93,Manchester United,...,68,15,21,13,90,85,87,88,94,€138.6M
4,4,192985,K. De Bruyne,27,https://cdn.sofifa.org/players/4/19/192985.png,Belgium,https://cdn.sofifa.org/flags/7.png,91,92,Manchester City,...,88,68,58,51,15,13,5,10,13,€196.4M


In [79]:
fifa.filter('Overall>70').select(['Name', 'Age']).limit(10).toPandas()

Unnamed: 0,Name,Age
0,L. Messi,31
1,Cristiano Ronaldo,33
2,Neymar Jr,26
3,De Gea,27
4,K. De Bruyne,27
5,E. Hazard,27
6,L. Modrić,32
7,L. Suárez,31
8,Sergio Ramos,32
9,J. Oblak,25


In [81]:
result = fifa.filter('Overall>50').select(['Nationality', 'Name', 'Age', 'Overall']) \
    .orderBy(fifa['Overall'].desc()).collect()

result[:5]

[Row(Nationality='Argentina', Name='L. Messi', Age='31', Overall='94'),
 Row(Nationality='Portugal', Name='Cristiano Ronaldo', Age='33', Overall='94'),
 Row(Nationality='Brazil', Name='Neymar Jr', Age='26', Overall='92'),
 Row(Nationality='Spain', Name='De Gea', Age='27', Overall='91'),
 Row(Nationality='Belgium', Name='K. De Bruyne', Age='27', Overall='91')]

In [83]:
type(result[0])

pyspark.sql.types.Row

In [86]:
print(f"Best player over 50: {result[0][1]}")
print(f"Worst player over 50: {result[-1][1]}")

Best player over 50: L. Messi
Worst player over 50: C. Addai
