# What Are DataFrames?

DataFrames generally refer to a data structure, which is tabular in nature. It represents rows, each of which consists of a number of observations. Rows can have a variety of data formats (heterogeneous), whereas a column can have data of the same data type (homogeneous). DataFrames usually contain some metadata in addition to data; for example, column and row names

We can say that DataFrames are nothing, but 2-dimensional data structures, similar to a SQL table or a spreadsheet. Now let's move ahead with this PySpark Dataframe Tutorial and understand why exactly we need Pyspark Dataframe.

# Why Do We Need DataFrames?

## 1: Processing Structured and Semi-Structured Data

## 2. Slicing and Dicing

## 3. Data Sources

## 4. Support for Multiple Languages

# Features of DataFrames

1: DataFrames are distributed in nature, which makes it a fault tolerant and highly available data structure.<br><br>
2:Lazy evaluation is an evaluation strategy which holds the evaluation of an expression until its value is needed. It avoids repeated evaluation. Lazy evaluation in Spark means that the execution will not start until an action is triggered. In Spark, the picture of lazy evaluation comes when Spark transformations occur.<br><br>
3:DataFrames are immutable in nature. By immutable, I mean that it is an object whose state cannot be modified after it is created. But we can transform its values by applying a certain transformation, like in RDDs.

# Pyspark DataFrames Example 1: FIFA World Cup Dataset

In [1]:
#Import Required Python & Spark Libraries

from pyspark.sql import SparkSession

In [2]:
spark = SparkSession \
    .builder \
    .appName("Create Spark Dataframes") \
    .getOrCreate()

In [7]:
fifaDF = spark.read.csv('fifa-world-cup/WorldCupPlayers.csv',header=True,inferSchema=True)

In [13]:
fifaDF.show()

+-------+-------+-------------+-------------------+-------+------------+-----------------+--------+---------+
|RoundID|MatchID|Team Initials|         Coach Name|Line-up|Shirt Number|      Player Name|Position|    Event|
+-------+-------+-------------+-------------------+-------+------------+-----------------+--------+---------+
|    201|   1096|          FRA|CAUDRON Raoul (FRA)|      S|           0|      Alex THEPOT|      GK|     null|
|    201|   1096|          MEX|   LUQUE Juan (MEX)|      S|           0|  Oscar BONFIGLIO|      GK|     null|
|    201|   1096|          FRA|CAUDRON Raoul (FRA)|      S|           0| Marcel LANGILLER|    null|     G40'|
|    201|   1096|          MEX|   LUQUE Juan (MEX)|      S|           0|     Juan CARRENO|    null|     G70'|
|    201|   1096|          FRA|CAUDRON Raoul (FRA)|      S|           0|  Ernest LIBERATI|    null|     null|
|    201|   1096|          MEX|   LUQUE Juan (MEX)|      S|           0|     Rafael GARZA|       C|     null|
|    201| 

In [9]:
fifaDF.printSchema()

root
 |-- RoundID: integer (nullable = true)
 |-- MatchID: integer (nullable = true)
 |-- Team Initials: string (nullable = true)
 |-- Coach Name: string (nullable = true)
 |-- Line-up: string (nullable = true)
 |-- Shirt Number: integer (nullable = true)
 |-- Player Name: string (nullable = true)
 |-- Position: string (nullable = true)
 |-- Event: string (nullable = true)



In [11]:
#record count
fifaDF.count()

37784

In [16]:
#Display columns
fifaDF.columns

['RoundID',
 'MatchID',
 'Team Initials',
 'Coach Name',
 'Line-up',
 'Shirt Number',
 'Player Name',
 'Position',
 'Event']

In [17]:
fifaDF.describe()

DataFrame[summary: string, RoundID: string, MatchID: string, Team Initials: string, Coach Name: string, Line-up: string, Shirt Number: string, Player Name: string, Position: string, Event: string]

In [22]:
#Selecting Distinct Multiple Columns
fifaDF.select('Player Name','Coach Name').distinct().show()

+--------------------+--------------------+
|         Player Name|          Coach Name|
+--------------------+--------------------+
|    Arturo FERNANDEZ| BRU Francisco (ESP)|
|Cayetano CARRERAS...|DURAND LAGUNA Jos...|
|  Ernesto MASCHERONI|SUPPICI Alberto (...|
|          Aziz FAHMY|   McREA James (SCO)|
|        Gyula POLGAR|    NADAS Odon (HUN)|
|  Ernesto ALBARRACIN|PASCUCCI Felipe (...|
| Armando CASTELLAZZI|POZZO Vittorio (ITA)|
|     Jaroslav BOUCEK|   PETRU Karel (TCH)|
|           Erwin NYC|  KALUZA Jozef (POL)|
|     Stanislaw BARAN|  KALUZA Jozef (POL)|
|     Fernando ROLDAN|BUCCIARDI Arturo ...|
|            Joe MACA|  JEFFREY Bill (SCO)|
|               INDIO|  MOREIRA Zeze (BRA)|
|      Rene DEREUDDRE|PIBAROT Pierre (FRA)|
|    Anton MALATINSKY|    CEJP Josef (TCH)|
|    Alberto MARIOTTI|LORENZO Juan Carl...|
|  Alfredo DI STEFANO|HERRERA Helenio (...|
|             FIDELIS| FEOLA Vicente (BRA)|
|     Stoyan YORDANOV|BOZHKOV Stefan (BUL)|
|      Wim RIJSBERGEN| MICHELS R

In [23]:
#Filtering Data
fifaDF.filter(fifaDF.MatchID=='1096').show()
fifaDF.filter(fifaDF.MatchID=='1096').count()

+-------+-------+-------------+-------------------+-------+------------+-----------------+--------+---------+
|RoundID|MatchID|Team Initials|         Coach Name|Line-up|Shirt Number|      Player Name|Position|    Event|
+-------+-------+-------------+-------------------+-------+------------+-----------------+--------+---------+
|    201|   1096|          FRA|CAUDRON Raoul (FRA)|      S|           0|      Alex THEPOT|      GK|     null|
|    201|   1096|          MEX|   LUQUE Juan (MEX)|      S|           0|  Oscar BONFIGLIO|      GK|     null|
|    201|   1096|          FRA|CAUDRON Raoul (FRA)|      S|           0| Marcel LANGILLER|    null|     G40'|
|    201|   1096|          MEX|   LUQUE Juan (MEX)|      S|           0|     Juan CARRENO|    null|     G70'|
|    201|   1096|          FRA|CAUDRON Raoul (FRA)|      S|           0|  Ernest LIBERATI|    null|     null|
|    201|   1096|          MEX|   LUQUE Juan (MEX)|      S|           0|     Rafael GARZA|       C|     null|
|    201| 

33

In [24]:
fifaDF.filter(fifaDF["MatchID"]=='1096').show()
fifaDF.filter(fifaDF["MatchID"]=='1096').count()

+-------+-------+-------------+-------------------+-------+------------+-----------------+--------+---------+
|RoundID|MatchID|Team Initials|         Coach Name|Line-up|Shirt Number|      Player Name|Position|    Event|
+-------+-------+-------------+-------------------+-------+------------+-----------------+--------+---------+
|    201|   1096|          FRA|CAUDRON Raoul (FRA)|      S|           0|      Alex THEPOT|      GK|     null|
|    201|   1096|          MEX|   LUQUE Juan (MEX)|      S|           0|  Oscar BONFIGLIO|      GK|     null|
|    201|   1096|          FRA|CAUDRON Raoul (FRA)|      S|           0| Marcel LANGILLER|    null|     G40'|
|    201|   1096|          MEX|   LUQUE Juan (MEX)|      S|           0|     Juan CARRENO|    null|     G70'|
|    201|   1096|          FRA|CAUDRON Raoul (FRA)|      S|           0|  Ernest LIBERATI|    null|     null|
|    201|   1096|          MEX|   LUQUE Juan (MEX)|      S|           0|     Rafael GARZA|       C|     null|
|    201| 

33

In [33]:
#Filtering Data (Multiple Parameters)
#We can filter our data based on multiple conditions (AND or OR)

fifaDF.filter((fifaDF.Position=='C') | (fifaDF.Event=="G40'")).show()

+-------+-------+-------------+--------------------+-------+------------+------------------+--------+-----+
|RoundID|MatchID|Team Initials|          Coach Name|Line-up|Shirt Number|       Player Name|Position|Event|
+-------+-------+-------------+--------------------+-------+------------+------------------+--------+-----+
|    201|   1096|          FRA| CAUDRON Raoul (FRA)|      S|           0|  Marcel LANGILLER|    null| G40'|
|    201|   1096|          MEX|    LUQUE Juan (MEX)|      S|           0|      Rafael GARZA|       C| null|
|    201|   1096|          FRA| CAUDRON Raoul (FRA)|      S|           0|   Alex VILLAPLANE|       C| null|
|    201|   1090|          USA|    MILLAR Bob (USA)|      S|           0|        Tom FLORIE|       C| G45'|
|    201|   1090|          BEL|GOETINCK Hector (...|      S|           0|     Pierre BRAINE|       C| null|
|    201|   1093|          BRA|DE CARVALHO Pinda...|      S|           0|         PREGUINHO|       C| G62'|
|    201|   1093|          Y

In [34]:
#Sorting Data (OrderBy)
fifaDF.orderBy(fifaDF.MatchID).show()

+-------+-------+-------------+--------------------+-------+------------+-------------------+--------+---------+
|RoundID|MatchID|Team Initials|          Coach Name|Line-up|Shirt Number|        Player Name|Position|    Event|
+-------+-------+-------------+--------------------+-------+------------+-------------------+--------+---------+
|    323|     25|          BRA|LAZARONI Sebastia...|      S|           1|           TAFFAREL|      GK|     null|
|    323|     25|          BRA|LAZARONI Sebastia...|      S|          21|       MAURO GALVAO|    null|Y50' O83'|
|    323|     25|          ARG|BILARDO Carlos (ARG)|      S|          12|   Sergio GOYCOCHEA|      GK|     Y87'|
|    323|     25|          BRA|LAZARONI Sebastia...|      S|           2|           JORGINHO|    null|     null|
|    323|     25|          ARG|BILARDO Carlos (ARG)|      S|           4|      Jose BASUALDO|    null|     null|
|    323|     25|          BRA|LAZARONI Sebastia...|      S|           3|      RICARDO GOMES|   

In [35]:
#Group By

fifaDF.groupby("Shirt Number")\
.count()\
.show()

+------------+-----+
|Shirt Number|count|
+------------+-----+
|          12| 1554|
|          22| 1544|
|           1| 1559|
|          13| 1550|
|           6| 1554|
|          16| 1554|
|           3| 1554|
|          20| 1551|
|           5| 1553|
|          19| 1554|
|          15| 1554|
|           9| 1554|
|          17| 1553|
|           4| 1554|
|           8| 1554|
|          23|  549|
|           7| 1554|
|          10| 1554|
|          21| 1546|
|          11| 1554|
+------------+-----+
only showing top 20 rows

