## Search and Filter DataFrames
This is the first thing to do when working with pyspark. The spark variable will also provide access to a UI to monitor jobs.

In [1]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SearchAndFilter").getOrCreate()

In [2]:
path = "Datasets/"

### Read csv and let it infer schema of the inputs

In [3]:
fifa = spark.read.csv(path+'fifa19.csv',header=True,inferSchema=True)

In [4]:
fifa.limit(5).toPandas()

Unnamed: 0,_c0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,...,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Release Clause
0,0,158023,L. Messi,31,https://cdn.sofifa.org/players/4/19/158023.png,Argentina,https://cdn.sofifa.org/flags/52.png,94,94,FC Barcelona,...,96,33,28,26,6,11,15,14,8,€226.5M
1,1,20801,Cristiano Ronaldo,33,https://cdn.sofifa.org/players/4/19/20801.png,Portugal,https://cdn.sofifa.org/flags/38.png,94,94,Juventus,...,95,28,31,23,7,11,15,14,11,€127.1M
2,2,190871,Neymar Jr,26,https://cdn.sofifa.org/players/4/19/190871.png,Brazil,https://cdn.sofifa.org/flags/54.png,92,93,Paris Saint-Germain,...,94,27,24,33,9,9,15,15,11,€228.1M
3,3,193080,De Gea,27,https://cdn.sofifa.org/players/4/19/193080.png,Spain,https://cdn.sofifa.org/flags/45.png,91,93,Manchester United,...,68,15,21,13,90,85,87,88,94,€138.6M
4,4,192985,K. De Bruyne,27,https://cdn.sofifa.org/players/4/19/192985.png,Belgium,https://cdn.sofifa.org/flags/7.png,91,92,Manchester City,...,88,68,58,51,15,13,5,10,13,€196.4M


#### Checking out the infered schema for the csv 

In [5]:
fifa.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- ID: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Photo: string (nullable = true)
 |-- Nationality: string (nullable = true)
 |-- Flag: string (nullable = true)
 |-- Overall: integer (nullable = true)
 |-- Potential: integer (nullable = true)
 |-- Club: string (nullable = true)
 |-- Club Logo: string (nullable = true)
 |-- Value: string (nullable = true)
 |-- Wage: string (nullable = true)
 |-- Special: integer (nullable = true)
 |-- Preferred Foot: string (nullable = true)
 |-- International Reputation: integer (nullable = true)
 |-- Weak Foot: integer (nullable = true)
 |-- Skill Moves: integer (nullable = true)
 |-- Work Rate: string (nullable = true)
 |-- Body Type: string (nullable = true)
 |-- Real Face: string (nullable = true)
 |-- Position: string (nullable = true)
 |-- Jersey Number: integer (nullable = true)
 |-- Joined: string (nullable = true)
 |-- Loaned From: string (nu

### We are going to use some functions defined in the sql.functions as we search and filter through it

In [6]:
from pyspark.sql.functions import *

In [7]:
fifa.select(["Nationality","Name","Age","Photo"]).show(5,False)

+-----------+-----------------+---+----------------------------------------------+
|Nationality|Name             |Age|Photo                                         |
+-----------+-----------------+---+----------------------------------------------+
|Argentina  |L. Messi         |31 |https://cdn.sofifa.org/players/4/19/158023.png|
|Portugal   |Cristiano Ronaldo|33 |https://cdn.sofifa.org/players/4/19/20801.png |
|Brazil     |Neymar Jr        |26 |https://cdn.sofifa.org/players/4/19/190871.png|
|Spain      |De Gea           |27 |https://cdn.sofifa.org/players/4/19/193080.png|
|Belgium    |K. De Bruyne     |27 |https://cdn.sofifa.org/players/4/19/192985.png|
+-----------+-----------------+---+----------------------------------------------+
only showing top 5 rows



Looks like all players in the first 5 rows are age 16 when ordered by their age

In [8]:
fifa.select(["Name","Age"]).orderBy(fifa["Age"]).show(5)

+------------+---+
|        Name|Age|
+------------+---+
|   B. Nygren| 16|
|H. Andersson| 16|
|    A. Doğan| 16|
|  C. Bassett| 16|
|    B. Mumba| 16|
+------------+---+
only showing top 5 rows



Let's checkout the same in descending order

In [9]:
fifa.select(["Name","Age"]).orderBy(fifa['Age'].desc()).show(5)

+-------------+---+
|         Name|Age|
+-------------+---+
|     O. Pérez| 45|
|K. Pilkington| 44|
|    T. Warner| 44|
|  S. Narazaki| 42|
|    J. Villar| 41|
+-------------+---+
only showing top 5 rows



#### I want all clubs that have Barcelona in them
The looks of it is very similar to how I would write a query in SQL to do the same. <br/>
SELECT Name,<br/>
Club <br/>
FROM fifa <br/>
WHERE Club LIKE "%Barcelona%";

In [10]:
fifa.select(["Name","Club"]).where(fifa.Club.like("%Barcelona%")).show(5,False)

+---------------+------------+
|Name           |Club        |
+---------------+------------+
|L. Messi       |FC Barcelona|
|L. Suárez      |FC Barcelona|
|M. ter Stegen  |FC Barcelona|
|Sergio Busquets|FC Barcelona|
|Coutinho       |FC Barcelona|
+---------------+------------+
only showing top 5 rows



#### I want to extract the photo extensions from the Photo column that conists of urls
To do this I utilize the susbtr function part of sql.functions and we start with the 4th digit from the last and go 4 places forward from there. We also assign the new column name as "extension".

In [11]:
fifa.select("Photo",fifa.Photo.substr(-4,4).alias("extension")).show(5,False)

+----------------------------------------------+---------+
|Photo                                         |extension|
+----------------------------------------------+---------+
|https://cdn.sofifa.org/players/4/19/158023.png|.png     |
|https://cdn.sofifa.org/players/4/19/20801.png |.png     |
|https://cdn.sofifa.org/players/4/19/190871.png|.png     |
|https://cdn.sofifa.org/players/4/19/193080.png|.png     |
|https://cdn.sofifa.org/players/4/19/192985.png|.png     |
+----------------------------------------------+---------+
only showing top 5 rows



#### I want to filter data such that the Club name is either FC  Barcelona or Juventus
If I had to do this with an SQL Query, it would look like <br/>
SELECT * <br/>
FROM fifa <br/>
WHERE Club IN ["FC BARCELONA", "JUVENTUS"]; <br/>

In [12]:
fifa[fifa.Club.isin("FC Barcelona","Juventus")].limit(4).toPandas()

Unnamed: 0,_c0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,...,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Release Clause
0,0,158023,L. Messi,31,https://cdn.sofifa.org/players/4/19/158023.png,Argentina,https://cdn.sofifa.org/flags/52.png,94,94,FC Barcelona,...,96,33,28,26,6,11,15,14,8,€226.5M
1,1,20801,Cristiano Ronaldo,33,https://cdn.sofifa.org/players/4/19/20801.png,Portugal,https://cdn.sofifa.org/flags/38.png,94,94,Juventus,...,95,28,31,23,7,11,15,14,11,€127.1M
2,7,176580,L. Suárez,31,https://cdn.sofifa.org/players/4/19/176580.png,Uruguay,https://cdn.sofifa.org/flags/60.png,91,91,FC Barcelona,...,85,62,45,38,27,25,31,33,37,€164M
3,15,211110,P. Dybala,24,https://cdn.sofifa.org/players/4/19/211110.png,Argentina,https://cdn.sofifa.org/flags/52.png,89,94,Juventus,...,84,23,20,20,5,4,4,5,8,€153.5M


#### I want to select a subset of data in which the player name starts with L and ends with i
If I had to do this in SQL it would look something like <br/>
SELECT Name, <br/>
Club <br/>
FROM fifa <br/>
WHERE NAME LIKE "L%i"; <br/>

In this example we've utilized the startswith and endswith methods available to us. Do note that using Like % is a slower operation

In [13]:
fifa.select("Name","Club").where(fifa.Name.startswith("L")).where(fifa.Name.endswith("i")).show()

+---------------+--------------------+
|           Name|                Club|
+---------------+--------------------+
|       L. Messi|        FC Barcelona|
|     L. Bonucci|            Juventus|
|   L. Fabiański|     West Ham United|
|  L. Pellegrini|                Roma|
|   L. Pavoletti|            Cagliari|
|    L. Podolski|         Vissel Kobe|
|     L. Tonelli|           Sampdoria|
|  L. Rossettini|       Chievo Verona|
|       L. Zuffi|       FC Basel 1893|
|   L. Antonelli|              Empoli|
|   L. Skorupski|             Bologna|
|    L. Vangioni|           Monterrey|
|L. De Silvestri|              Torino|
|    L. Cigarini|            Cagliari|
|      L. Rigoni|               Parma|
|   L. Cavallini|           Puebla FC|
|   Léo Bonatini|Wolverhampton Wan...|
|  L. Mazzitelli|               Genoa|
|  L. Pisculichi|  Argentinos Juniors|
|      L. Sigali|         Racing Club|
+---------------+--------------------+
only showing top 20 rows



#### Viewing the number of rows

In [14]:
fifa.count()

18207

### Getting to some exciting functionalities of slicing data

In [17]:
df = spark.createDataFrame([([1,2,3],),([4,5],)],['x'])
df.show()

+---------+
|        x|
+---------+
|[1, 2, 3]|
|   [4, 5]|
+---------+



#### Here we are slicing the column x starting at index 2 and going two positions from there. 
##### Note that the indexing is 1-indexed

In [18]:
df.select(slice(df.x,2,2).alias("sliced_x")).show()

+--------+
|sliced_x|
+--------+
|  [2, 3]|
|     [5]|
+--------+



### Filter data through expressions

In [19]:
fifa.filter("Overall>50").limit(5).toPandas()

Unnamed: 0,_c0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,...,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Release Clause
0,0,158023,L. Messi,31,https://cdn.sofifa.org/players/4/19/158023.png,Argentina,https://cdn.sofifa.org/flags/52.png,94,94,FC Barcelona,...,96,33,28,26,6,11,15,14,8,€226.5M
1,1,20801,Cristiano Ronaldo,33,https://cdn.sofifa.org/players/4/19/20801.png,Portugal,https://cdn.sofifa.org/flags/38.png,94,94,Juventus,...,95,28,31,23,7,11,15,14,11,€127.1M
2,2,190871,Neymar Jr,26,https://cdn.sofifa.org/players/4/19/190871.png,Brazil,https://cdn.sofifa.org/flags/54.png,92,93,Paris Saint-Germain,...,94,27,24,33,9,9,15,15,11,€228.1M
3,3,193080,De Gea,27,https://cdn.sofifa.org/players/4/19/193080.png,Spain,https://cdn.sofifa.org/flags/45.png,91,93,Manchester United,...,68,15,21,13,90,85,87,88,94,€138.6M
4,4,192985,K. De Bruyne,27,https://cdn.sofifa.org/players/4/19/192985.png,Belgium,https://cdn.sofifa.org/flags/7.png,91,92,Manchester City,...,88,68,58,51,15,13,5,10,13,€196.4M


In [20]:
fifa.filter("Overall>50").select("Name","Age").limit(5).toPandas()

Unnamed: 0,Name,Age
0,L. Messi,31
1,Cristiano Ronaldo,33
2,Neymar Jr,26
3,De Gea,27
4,K. De Bruyne,27


#### So here we are combining some functionalities together and using collect and not show
Difference between Show and Collect. Collect is an expensive operation than show. To explain with a simple analogy, let's say a friend has bought a new car and wants to show you. If you see the car over a Zoom call, that's show and if you go travel all the way to their place to view the car in reality that's collect. Ofcourse, traveling all the way is much more expensive than a Zoom call.

In [21]:
result = fifa.filter("Overall>50").select("Nationality","Name","Age","Overall").orderBy(fifa["Overall"].desc()).collect()

With collect the result is a list of rows and not a dataframe

In [22]:
type(result[0])

pyspark.sql.types.Row

In [23]:
result

[Row(Nationality='Argentina', Name='L. Messi', Age=31, Overall=94),
 Row(Nationality='Portugal', Name='Cristiano Ronaldo', Age=33, Overall=94),
 Row(Nationality='Brazil', Name='Neymar Jr', Age=26, Overall=92),
 Row(Nationality='Spain', Name='De Gea', Age=27, Overall=91),
 Row(Nationality='Belgium', Name='K. De Bruyne', Age=27, Overall=91),
 Row(Nationality='Belgium', Name='E. Hazard', Age=27, Overall=91),
 Row(Nationality='Croatia', Name='L. Modrić', Age=32, Overall=91),
 Row(Nationality='Uruguay', Name='L. Suárez', Age=31, Overall=91),
 Row(Nationality='Spain', Name='Sergio Ramos', Age=32, Overall=91),
 Row(Nationality='Slovenia', Name='J. Oblak', Age=25, Overall=90),
 Row(Nationality='Poland', Name='R. Lewandowski', Age=29, Overall=90),
 Row(Nationality='Germany', Name='T. Kroos', Age=28, Overall=90),
 Row(Nationality='Uruguay', Name='D. Godín', Age=32, Overall=90),
 Row(Nationality='Spain', Name='David Silva', Age=32, Overall=90),
 Row(Nationality='France', Name='N. Kanté', Age=27, 

In [24]:
print("Best player over 50",result[0][1])

Best player over 50 L. Messi


In [25]:
print("Worst player over 50",result[-1][1])

Worst player over 50 C. Addai
