# PySpark - Search and Filter DataFrames - more techniques
* Notebook by Adam Lang
* Date: 1/3/2024

# Overview
* In this notebook we will go over additional uses of search and filter techniques in PySpark.

* This is a full list available function in pyspark.sql.functions library:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions


## First set up a Spark Session


In [1]:
## setup spark session
import pyspark
from pyspark.sql import SparkSession

## init spark session
spark = SparkSession.builder.appName("SearchFilter2").getOrCreate()

## view spark session
spark

In [2]:
## get spark cores
cores = spark._jsc.sc().getExecutorMemoryStatus().keySet().size()
print(f"The cores in this session are: {cores}'cores'.")

The cores in this session are: 1'cores'.


## Read in the DataFrame for this Notebook

We will be continuing to use the `fifa19.csv` file for this notebook with additional search and filter functions.

In [5]:
## set path of data
path = "/content/drive/MyDrive/Colab Notebooks/PySpark Data Science/"

## load df
fifa_df = spark.read.csv(path+'fifa19.csv',\
                         inferSchema=True,\
                         header=True)

## show df
fifa_df.show(5,False)

+---+------+-----------------+---+----------------------------------------------+-----------+-----------------------------------+-------+---------+-------------------+--------------------------------------------+-------+-----+-------+--------------+------------------------+---------+-----------+--------------+----------+---------+--------+-------------+------------+-----------+--------------------+------+------+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+--------+---------+---------------+------------+-------+---------+-----+----------+-----------+-----------+------------+-----------+-------+---------+-------+---------+-------+-------+--------+---------+----------+-------------+-----------+------+---------+---------+-------+--------------+-------------+--------+----------+---------+-------------+----------+--------------+
|_c0|ID    |Name             |Age|Photo                                         

## About this dataframe

The **fifa19.csv** dataset includes a list of all the FIFA 2019 players and their attributes listed below:

 - **General**: Age, Nationality, Overall, Potential, Club
 - **Metrics:** Value, Wage
 - **Player Descriptive:** Preferred Foot, International Reputation, Weak Foot, Skill Moves, Work Rate, Position, Jersey Number, Joined, Loaned From, Contract Valid Until, Height, Weight
 - **Possition:** LS, ST, RS, LW, LF, CF, RF, RW, LAM, CAM, RAM, LM, LCM, CM, RCM, RM, LWB, LDM, CDM, RDM, RWB, LB, LCB, CB, RCB, RB,
 - **Other:** Crossing, Finishing, Heading, Accuracy, ShortPassing, Volleys, Dribbling, Curve, FKAccuracy, LongPassing, BallControl, Acceleration, SprintSpeed, Agility, Reactions, Balance, ShotPower, Jumping, Stamina, Strength, LongShots, Aggression, Interceptions, Positioning, Vision, Penalties, Composure, Marking, StandingTackle, SlidingTackle, GKDiving, GKHandling, GKKicking, GKPositioning, GKReflexes, and Release Clause.

**Source:** https://www.kaggle.com/karangadiya/fifa19

Now lets use the `.toPandas()` method to view the first few lines of the dataset so we know what we are working with.

In [6]:
## show using .toPandas()
fifa_df.limit(5).toPandas()

Unnamed: 0,_c0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,...,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Release Clause
0,0,158023,L. Messi,31,https://cdn.sofifa.org/players/4/19/158023.png,Argentina,https://cdn.sofifa.org/flags/52.png,94,94,FC Barcelona,...,96,33,28,26,6,11,15,14,8,€226.5M
1,1,20801,Cristiano Ronaldo,33,https://cdn.sofifa.org/players/4/19/20801.png,Portugal,https://cdn.sofifa.org/flags/38.png,94,94,Juventus,...,95,28,31,23,7,11,15,14,11,€127.1M
2,2,190871,Neymar Jr,26,https://cdn.sofifa.org/players/4/19/190871.png,Brazil,https://cdn.sofifa.org/flags/54.png,92,93,Paris Saint-Germain,...,94,27,24,33,9,9,15,15,11,€228.1M
3,3,193080,De Gea,27,https://cdn.sofifa.org/players/4/19/193080.png,Spain,https://cdn.sofifa.org/flags/45.png,91,93,Manchester United,...,68,15,21,13,90,85,87,88,94,€138.6M
4,4,192985,K. De Bruyne,27,https://cdn.sofifa.org/players/4/19/192985.png,Belgium,https://cdn.sofifa.org/flags/7.png,91,92,Manchester City,...,88,68,58,51,15,13,5,10,13,€196.4M


Now let's print the schema of the dataset so we can see the data types of all the varaibles.

In [28]:
## printSchema()
print(fifa_df.printSchema())

root
 |-- _c0: integer (nullable = true)
 |-- ID: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Photo: string (nullable = true)
 |-- Nationality: string (nullable = true)
 |-- Flag: string (nullable = true)
 |-- Overall: integer (nullable = true)
 |-- Potential: integer (nullable = true)
 |-- Club: string (nullable = true)
 |-- Club Logo: string (nullable = true)
 |-- Value: string (nullable = true)
 |-- Wage: string (nullable = true)
 |-- Special: integer (nullable = true)
 |-- Preferred Foot: string (nullable = true)
 |-- International Reputation: integer (nullable = true)
 |-- Weak Foot: integer (nullable = true)
 |-- Skill Moves: integer (nullable = true)
 |-- Work Rate: string (nullable = true)
 |-- Body Type: string (nullable = true)
 |-- Real Face: string (nullable = true)
 |-- Position: string (nullable = true)
 |-- Jersey Number: integer (nullable = true)
 |-- Joined: string (nullable = true)
 |-- Loaned From: string (nu

## SQL Functions in PySpark Continued

### Import pyspark sql functions library


In [8]:
## import all pyspark sql functions
from pyspark.sql.functions import *

### 1. Select the Name and Position of each player in the dataframe

In [29]:
## select name and position
fifa_df.select(['Name','Position','Release Clause']).show(5,False)

+-----------------+--------+--------------+
|Name             |Position|Release Clause|
+-----------------+--------+--------------+
|L. Messi         |RF      |€226.5M       |
|Cristiano Ronaldo|ST      |€127.1M       |
|Neymar Jr        |LW      |€228.1M       |
|De Gea           |GK      |€138.6M       |
|K. De Bruyne     |RCM     |€196.4M       |
+-----------------+--------+--------------+
only showing top 5 rows



### 1.1 Display the same results from above sorted by the players names

In [30]:
## sort by players names
fifa_df.select(['Name','Position']).orderBy(fifa_df['Name']).show(5,False)

+-------------+--------+
|Name         |Position|
+-------------+--------+
|A. Abang     |ST      |
|A. Abdellaoui|LB      |
|A. Abdennour |CB      |
|A. Abdi      |CM      |
|A. Abdu Jaber|ST      |
+-------------+--------+
only showing top 5 rows



### 2. Select only the players who belong to a club begining with FC

In [31]:
## only players in an "FC% club" --> orderby name asc
fifa_df.select(['Name','Position','Club']).where(fifa_df.Club.like("FC%")).orderBy(fifa_df['Name']).limit(5).toPandas()

Unnamed: 0,Name,Position,Club
0,A. Abdellaoui,LB,FC Sion
1,A. Absalem,LB,FC Groningen
2,A. Aguilar,CDM,FC Dallas
3,A. Ajeti,ST,FC Basel 1893
4,A. Baba,LB,FC Schalke 04


### 3. Who is the oldest player in the dataset and how old are they?

Display only the name and age of the oldest player.

In [32]:
## name and age of oldest player
fifa_df.select(['Name','Age']).orderBy(fifa_df['Age'].desc()).show(1)

+--------+---+
|    Name|Age|
+--------+---+
|O. Pérez| 45|
+--------+---+
only showing top 1 row



### 4. Select only the following players from the dataframe:

 - L. Messi
 - Cristiano Ronaldo

In [33]:
## select specific players --> lets try isin
fifa_df[fifa_df.Name.isin("L. Messi","Cristiano Ronaldo")].limit(5).toPandas()

Unnamed: 0,_c0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,...,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Release Clause
0,0,158023,L. Messi,31,https://cdn.sofifa.org/players/4/19/158023.png,Argentina,https://cdn.sofifa.org/flags/52.png,94,94,FC Barcelona,...,96,33,28,26,6,11,15,14,8,€226.5M
1,1,20801,Cristiano Ronaldo,33,https://cdn.sofifa.org/players/4/19/20801.png,Portugal,https://cdn.sofifa.org/flags/38.png,94,94,Juventus,...,95,28,31,23,7,11,15,14,11,€127.1M


### 5. Select the first character from the Release Clause variable which indicates the currency used?

In [23]:
## substring slice for currency char
fifa_df.select('Release Clause',fifa_df['Release Clause'].substr(1,1)).show(5,False)

+--------------+-------------------------------+
|Release Clause|substring(Release Clause, 1, 1)|
+--------------+-------------------------------+
|€226.5M       |€                              |
|€127.1M       |€                              |
|€228.1M       |€                              |
|€138.6M       |€                              |
|€196.4M       |€                              |
+--------------+-------------------------------+
only showing top 5 rows



### 6. Select only the players who are over the age of 40

In [27]:
## players over age 40
fifa_df.filter("Age>40").select(['Name','Age']).toPandas()

Unnamed: 0,Name,Age
0,J. Villar,41
1,B. Nivet,41
2,O. Pérez,45
3,C. Muñoz,41
4,S. Narazaki,42
5,H. Sulaimani,41
6,M. Tyler,41
7,T. Warner,44
8,K. Pilkington,44
