# RDD creation

#### [Introduction to Spark with Python, by Jose A. Dianes](https://github.com/jadianes/spark-py-notebooks)

Apache Spark trabaja con un conjunto de datos denominados RDD (Resilient Distributed Dataset o Conjunto de Datos Distribuidos Resistentes), estos poseen una serie de características que los hacen diferenciarse de otros tipos de estructuras de datos:
  + Inmutables: Una vez creados no se pueden modificar.
  + Distribuidos: Hace referencia al RDD, están divididos en particiones que están repartidas por el clúster
  + Resilientes: Esto quiere decir que en el caso de que se pierda una partición, esta se regenara automáticamente.

Los RDD a pesar de ser inmutables pueden ser transformados, de manera que se crean un nuevo RDD y estas transformaciones se aplican a los datos del nuevo RDD.

Existen distintas formas de generar RDDs:
  + A partir de un fichero
  + Distribución de datos desde el driver
  + Transformar un RDD para crear un nuevo RDD.

## Ciclo de vida de un RDD

![ciclo de vida de RDD](https://keepcoding.io/wp-content/uploads/2022/06/image-39-1024x473.png)

# SparkContext

SparkContext o Punto de acceso. 

Para realizar operaciones necesitamos un Context: 
  + SparkContext, SQLContext...

Dependerá del tipo de operación al principio estaba SparkContext y se usaba para operaciones con RDDs, despues salio SparkSession, para RDDs, Dataframes y Datsets. 

SparkSession contempla internamente el SparkContext, HiveContext, SQLContext...

SparkSession nos sirve para todos los contextos.

En principio usar SparkSession sería lo más correcto, ya que establecemos una sesión con el nodo maestro.

## PySpark

**PySpark** es la interfaz de programación de **Python** para el framework de procesamiento distribuido **Apache Spark**.

**Spark** es un motor de procesamiento de datos distribuido y de alto rendimiento que se utiliza para procesar grandes volúmenes de datos de manera escalable y eficiente en clústeres de computadoras.

**PySpark** se utiliza comúnmente para tareas de procesamiento de datos, aprendizaje automático, análisis de datos en tiempo real, y para la construcción de aplicaciones de procesamiento de grandes volúmenes de datos.

In [None]:
!pip install pyspark

_**Usar PySpark en Jupyter:** https://changhsinlee.com/install-pyspark-windows-jupyter/_

In [1]:
import pyspark
pyspark.__version__

'3.3.0'

**Documentación `PySpark`**: https://spark.apache.org/docs/3.1.1/api/python/reference/index.html

In [2]:
#import findspark
#findspark.init()

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("pyspark_teoria").getOrCreate()
spark

In [3]:
# numero de nucleos
cores = spark._jsc.sc().getExecutorMemoryStatus().keySet().size()
cores

1

### Cargar un df

In [4]:
# Leer un archivo con PySpark
titanic = spark.read.csv(path        = "../data/titanic.txt",
                         inferSchema = True, header = True)

In [5]:
titanic

DataFrame[PassengerId: int, Survived: int, Pclass: int, Name: string, Sex: string, Age: double, SibSp: int, Parch: int, Ticket: string, Fare: double, Cabin: string, Embarked: string]

In [6]:
titanic.printSchema()

root
 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)



In [7]:
titanic.show(5, truncate = True)

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|
+-----------+--------+------+--------------------+------+----+-----+-----+------

Devuelve la primera fila del Dataframe

In [8]:
titanic.first()

Row(PassengerId=1, Survived=0, Pclass=3, Name='Braund, Mr. Owen Harris', Sex='male', Age=22.0, SibSp=1, Parch=0, Ticket='A/5 21171', Fare=7.25, Cabin=None, Embarked='S')

Devuelve 4 filas del DataFrame y las pasamos a Pandas

In [9]:
titanic.limit(4).toPandas()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S


### Data Validation 

printSchema nos devuelve el esquema de nuestro dataframe

In [10]:
type(titanic)

pyspark.sql.dataframe.DataFrame

In [11]:
titanic.printSchema()

root
 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)



columns nos muestra las columas del dataframe en un lista que podremos recorrer

In [12]:
titanic.columns

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

`count()` nos devuelve el número de filas de nuestro dataframe

In [13]:
titanic.count()

891

`describe()` al igual que Pandas nos devuelve una descripción estadistica de nuestros datos.

In [14]:
titanic.describe().toPandas().T.rename(columns={0:'count',1:'mean',2:'stddec',3:'min',4:'max'}).drop('summary', axis=0)

Unnamed: 0,count,mean,stddec,min,max
PassengerId,891,446.0,257.3538420152301,1,891
Survived,891,0.3838383838383838,0.4865924542648575,0,1
Pclass,891,2.308641975308642,0.8360712409770491,1,3
Name,891,,,"""Andersson, Mr. August Edvard (""""Wennerstrom"""")""","van Melkebeke, Mr. Philemon"
Sex,891,,,female,male
Age,714,29.69911764705882,14.526497332334037,0.42,80.0
SibSp,891,0.5230078563411896,1.1027434322934315,0,8
Parch,891,0.3815937149270482,0.8060572211299488,0,6
Ticket,891,260318.54916792735,471609.26868834975,110152,WE/P 5735
Fare,891,32.2042079685746,49.69342859718089,0.0,512.3292


`describe()` con Pandas

In [15]:
titanic.toPandas().describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
PassengerId,891.0,,,,446.0,257.353842,1.0,223.5,446.0,668.5,891.0
Survived,891.0,,,,0.383838,0.486592,0.0,0.0,0.0,1.0,1.0
Pclass,891.0,,,,2.308642,0.836071,1.0,2.0,3.0,3.0,3.0
Name,891.0,891.0,"Braund, Mr. Owen Harris",1.0,,,,,,,
Sex,891.0,2.0,male,577.0,,,,,,,
Age,714.0,,,,29.699118,14.526497,0.42,20.125,28.0,38.0,80.0
SibSp,891.0,,,,0.523008,1.102743,0.0,0.0,0.0,1.0,8.0
Parch,891.0,,,,0.381594,0.806057,0.0,0.0,0.0,0.0,6.0
Ticket,891.0,681.0,347082,7.0,,,,,,,
Fare,891.0,,,,32.204208,49.693429,0.0,7.9104,14.4542,31.0,512.3292


In [16]:
titanic.schema["Ticket"].dataType

StringType()

In [17]:
titanic.select("age", "fare").summary("count", "min", "max", "mean").show()

+-------+-----------------+----------------+
|summary|              age|            fare|
+-------+-----------------+----------------+
|  count|              714|             891|
|    min|             0.42|             0.0|
|    max|             80.0|        512.3292|
|   mean|29.69911764705882|32.2042079685746|
+-------+-----------------+----------------+



### Espeficicar dtypes de columnas

In [3]:
from pyspark.sql.types import *

In [19]:
# PySpark reconoce todos como strings

people = spark.read.json(path = "../data/people.json")

print(people.printSchema())

people.limit(4).toPandas()

root
 |-- _corrupt_record: string (nullable = true)
 |-- city: string (nullable = true)
 |-- creditcard: string (nullable = true)
 |-- email: string (nullable = true)
 |-- mac: string (nullable = true)
 |-- name: string (nullable = true)
 |-- timestamp: string (nullable = true)

None


Unnamed: 0,_corrupt_record,city,creditcard,email,mac,name,timestamp
0,[,,,,,,
1,,Lake Gladysberg,1228-1221-1221-1431,katlyn@jenkinsmaggio.net,08:fd:0b:cd:77:f7,Keeley Bosco,2015-04-25 13:57:36 +0700
2,,,1228-1221-1221-1431,juvenal@johnston.name,90:4d:fa:42:63:a2,Rubye Jerde,2015-04-25 09:02:04 +0700
3,,,,,f9:0e:d3:40:cb:e9,Miss Darian Breitenberg,2015-04-25 13:16:03 +0700


In [20]:
# Cambiamos el dtype de "timestamp" a DateType()

data_schema = list((StructField("timestamp" ,   DateType(), True),
                    StructField("name"      , StringType(), True),
                    StructField("email"     , StringType(), True),
                    StructField("city"      , StringType(), True),
                    StructField("mac"       , StringType(), True),
                    StructField("creditcard", StringType(), True)))

final_struc = StructType(fields = data_schema)

In [21]:
final_struc

StructType([StructField('timestamp', DateType(), True), StructField('name', StringType(), True), StructField('email', StringType(), True), StructField('city', StringType(), True), StructField('mac', StringType(), True), StructField('creditcard', StringType(), True)])

In [22]:
# Leemos el archivo otra vez pero especificando el schema

people = spark.read.json(path   = "../data/people.json",
                         schema = final_struc)

In [23]:
people.printSchema()

root
 |-- timestamp: date (nullable = true)
 |-- name: string (nullable = true)
 |-- email: string (nullable = true)
 |-- city: string (nullable = true)
 |-- mac: string (nullable = true)
 |-- creditcard: string (nullable = true)



In [24]:
people.limit(4).toPandas()

Unnamed: 0,timestamp,name,email,city,mac,creditcard
0,,,,,,
1,2015-04-25,Keeley Bosco,katlyn@jenkinsmaggio.net,Lake Gladysberg,08:fd:0b:cd:77:f7,1228-1221-1221-1431
2,2015-04-25,Rubye Jerde,juvenal@johnston.name,,90:4d:fa:42:63:a2,1228-1221-1221-1431
3,2015-04-25,Miss Darian Breitenberg,,,f9:0e:d3:40:cb:e9,


### Buscar y Filtrar

In [4]:
from pyspark.sql.types import *
from pyspark.sql.functions import *

In [26]:
fifa = spark.read.csv(path        = "../data/fifa19.csv",
                      inferSchema = True, header = True)

fifa.limit(4).toPandas()

Unnamed: 0,_c0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,...,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Release Clause
0,0,158023,L. Messi,31,https://cdn.sofifa.org/players/4/19/158023.png,Argentina,https://cdn.sofifa.org/flags/52.png,94,94,FC Barcelona,...,96,33,28,26,6,11,15,14,8,€226.5M
1,1,20801,Cristiano Ronaldo,33,https://cdn.sofifa.org/players/4/19/20801.png,Portugal,https://cdn.sofifa.org/flags/38.png,94,94,Juventus,...,95,28,31,23,7,11,15,14,11,€127.1M
2,2,190871,Neymar Jr,26,https://cdn.sofifa.org/players/4/19/190871.png,Brazil,https://cdn.sofifa.org/flags/54.png,92,93,Paris Saint-Germain,...,94,27,24,33,9,9,15,15,11,€228.1M
3,3,193080,De Gea,27,https://cdn.sofifa.org/players/4/19/193080.png,Spain,https://cdn.sofifa.org/flags/45.png,91,93,Manchester United,...,68,15,21,13,90,85,87,88,94,€138.6M


In [27]:
fifa.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- ID: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Photo: string (nullable = true)
 |-- Nationality: string (nullable = true)
 |-- Flag: string (nullable = true)
 |-- Overall: integer (nullable = true)
 |-- Potential: integer (nullable = true)
 |-- Club: string (nullable = true)
 |-- Club Logo: string (nullable = true)
 |-- Value: string (nullable = true)
 |-- Wage: string (nullable = true)
 |-- Special: integer (nullable = true)
 |-- Preferred Foot: string (nullable = true)
 |-- International Reputation: integer (nullable = true)
 |-- Weak Foot: integer (nullable = true)
 |-- Skill Moves: integer (nullable = true)
 |-- Work Rate: string (nullable = true)
 |-- Body Type: string (nullable = true)
 |-- Real Face: string (nullable = true)
 |-- Position: string (nullable = true)
 |-- Jersey Number: integer (nullable = true)
 |-- Joined: string (nullable = true)
 |-- Loaned From: string (nu

In [28]:
# Para seleccionar columnas usamos .select y pasamos una lista con las columnas (los corchetes son opcionales)

fifa.select(["Nationality", "Name", "Age", "Photo"]).show(5, truncate = False)

+-----------+-----------------+---+----------------------------------------------+
|Nationality|Name             |Age|Photo                                         |
+-----------+-----------------+---+----------------------------------------------+
|Argentina  |L. Messi         |31 |https://cdn.sofifa.org/players/4/19/158023.png|
|Portugal   |Cristiano Ronaldo|33 |https://cdn.sofifa.org/players/4/19/20801.png |
|Brazil     |Neymar Jr        |26 |https://cdn.sofifa.org/players/4/19/190871.png|
|Spain      |De Gea           |27 |https://cdn.sofifa.org/players/4/19/193080.png|
|Belgium    |K. De Bruyne     |27 |https://cdn.sofifa.org/players/4/19/192985.png|
+-----------+-----------------+---+----------------------------------------------+
only showing top 5 rows



In [29]:
fifa.select("Nationality", "Name", "Age", "Photo").show(5, truncate = False)

+-----------+-----------------+---+----------------------------------------------+
|Nationality|Name             |Age|Photo                                         |
+-----------+-----------------+---+----------------------------------------------+
|Argentina  |L. Messi         |31 |https://cdn.sofifa.org/players/4/19/158023.png|
|Portugal   |Cristiano Ronaldo|33 |https://cdn.sofifa.org/players/4/19/20801.png |
|Brazil     |Neymar Jr        |26 |https://cdn.sofifa.org/players/4/19/190871.png|
|Spain      |De Gea           |27 |https://cdn.sofifa.org/players/4/19/193080.png|
|Belgium    |K. De Bruyne     |27 |https://cdn.sofifa.org/players/4/19/192985.png|
+-----------+-----------------+---+----------------------------------------------+
only showing top 5 rows



In [31]:
(
    fifa
    .select(
        col('Nationality').alias('Nacionalidad'), 
        col('Name').alias('Nombre'),
        col('Age').alias('Edad'),
        col('Photo').alias('Fotografía')
    )
).show(5, truncate=False)

+------------+-----------------+----+----------------------------------------------+
|Nacionalidad|Nombre           |Edad|Fotografía                                    |
+------------+-----------------+----+----------------------------------------------+
|Argentina   |L. Messi         |31  |https://cdn.sofifa.org/players/4/19/158023.png|
|Portugal    |Cristiano Ronaldo|33  |https://cdn.sofifa.org/players/4/19/20801.png |
|Brazil      |Neymar Jr        |26  |https://cdn.sofifa.org/players/4/19/190871.png|
|Spain       |De Gea           |27  |https://cdn.sofifa.org/players/4/19/193080.png|
|Belgium     |K. De Bruyne     |27  |https://cdn.sofifa.org/players/4/19/192985.png|
+------------+-----------------+----+----------------------------------------------+
only showing top 5 rows



In [33]:
# OrderBy, por defecto ascending = True

fifa.select(["Name", "Age"])\
    .orderBy(fifa["Age"]).show(5)

#fifa.select(["Name", "Age"])\
#    .orderBy(fifa["Age"].asc()).show(5)

+------------+---+
|        Name|Age|
+------------+---+
|   B. Nygren| 16|
|H. Andersson| 16|
|    A. Doğan| 16|
|  C. Bassett| 16|
|    B. Mumba| 16|
+------------+---+
only showing top 5 rows



In [42]:
(
    fifa
    .select(
        'Name', 
        'Age',
        'Club'
    )
    .orderBy(
        'age'
    )
).show(5)

+------------+---+---------------+
|        Name|Age|           Club|
+------------+---+---------------+
|   B. Nygren| 16|   IFK Göteborg|
|H. Andersson| 16|      Örebro SK|
|    A. Doğan| 16|    Kayserispor|
|  C. Bassett| 16|Colorado Rapids|
|    B. Mumba| 16|     Sunderland|
+------------+---+---------------+
only showing top 5 rows



Ascendente

In [34]:
(
    fifa
    .select('Name', 'Age')
    .orderBy(col('age').asc())
).show(5)

+------------+---+
|        Name|Age|
+------------+---+
|   B. Nygren| 16|
|H. Andersson| 16|
|    A. Doğan| 16|
|  C. Bassett| 16|
|    B. Mumba| 16|
+------------+---+
only showing top 5 rows



In [35]:
# .desc()

fifa.select(["Name", "Age"])\
    .orderBy(fifa["Age"].desc()).show(5)

+-------------+---+
|         Name|Age|
+-------------+---+
|     O. Pérez| 45|
|K. Pilkington| 44|
|    T. Warner| 44|
|  S. Narazaki| 42|
|     C. Muñoz| 41|
+-------------+---+
only showing top 5 rows



In [36]:
(
    fifa
    .select('Name', 'Age')
    .orderBy(col('age').desc())
).show(5)

+-------------+---+
|         Name|Age|
+-------------+---+
|     O. Pérez| 45|
|K. Pilkington| 44|
|    T. Warner| 44|
|  S. Narazaki| 42|
|    J. Villar| 41|
+-------------+---+
only showing top 5 rows



In [43]:
# Para filtrar por palabras podemos usar .where en conjunto con .like

fifa.select(["Name", "Club"])\
    .where(fifa.Club.like("%Barcelona%")).show(5, truncate = False)

+---------------+------------+
|Name           |Club        |
+---------------+------------+
|L. Messi       |FC Barcelona|
|L. Suárez      |FC Barcelona|
|M. ter Stegen  |FC Barcelona|
|Sergio Busquets|FC Barcelona|
|Coutinho       |FC Barcelona|
+---------------+------------+
only showing top 5 rows



O con la función filter

In [45]:
(
    fifa
    .select(
        'Name',
        'Club'
    )
    .filter(
        col('Club').like('%Barcelona%')
    )
).show(5, truncate=False)

+---------------+------------+
|Name           |Club        |
+---------------+------------+
|L. Messi       |FC Barcelona|
|L. Suárez      |FC Barcelona|
|M. ter Stegen  |FC Barcelona|
|Sergio Busquets|FC Barcelona|
|Coutinho       |FC Barcelona|
+---------------+------------+
only showing top 5 rows



In [46]:
# Podemos utilizar .substr() para hacer "slicing" a una cadena de caracteres

fifa.select("Photo", fifa.Photo.substr(-4, 4)).show(5, truncate = False)

+----------------------------------------------+-----------------------+
|Photo                                         |substring(Photo, -4, 4)|
+----------------------------------------------+-----------------------+
|https://cdn.sofifa.org/players/4/19/158023.png|.png                   |
|https://cdn.sofifa.org/players/4/19/20801.png |.png                   |
|https://cdn.sofifa.org/players/4/19/190871.png|.png                   |
|https://cdn.sofifa.org/players/4/19/193080.png|.png                   |
|https://cdn.sofifa.org/players/4/19/192985.png|.png                   |
+----------------------------------------------+-----------------------+
only showing top 5 rows



In [47]:
# .isin similar a Pandas

fifa[fifa.Club.isin("FC Barcelona", "Juventus")].limit(5).toPandas()

Unnamed: 0,_c0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,...,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Release Clause
0,0,158023,L. Messi,31,https://cdn.sofifa.org/players/4/19/158023.png,Argentina,https://cdn.sofifa.org/flags/52.png,94,94,FC Barcelona,...,96,33,28,26,6,11,15,14,8,€226.5M
1,1,20801,Cristiano Ronaldo,33,https://cdn.sofifa.org/players/4/19/20801.png,Portugal,https://cdn.sofifa.org/flags/38.png,94,94,Juventus,...,95,28,31,23,7,11,15,14,11,€127.1M
2,7,176580,L. Suárez,31,https://cdn.sofifa.org/players/4/19/176580.png,Uruguay,https://cdn.sofifa.org/flags/60.png,91,91,FC Barcelona,...,85,62,45,38,27,25,31,33,37,€164M
3,15,211110,P. Dybala,24,https://cdn.sofifa.org/players/4/19/211110.png,Argentina,https://cdn.sofifa.org/flags/52.png,89,94,Juventus,...,84,23,20,20,5,4,4,5,8,€153.5M
4,18,192448,M. ter Stegen,26,https://cdn.sofifa.org/players/4/19/192448.png,Germany,https://cdn.sofifa.org/flags/21.png,89,92,FC Barcelona,...,69,25,13,10,87,85,88,85,90,€123.3M


In [48]:
fifa[fifa.Club.isin("FC Barcelon", "Juventus")].limit(5).toPandas()

Unnamed: 0,_c0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,...,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Release Clause
0,1,20801,Cristiano Ronaldo,33,https://cdn.sofifa.org/players/4/19/20801.png,Portugal,https://cdn.sofifa.org/flags/38.png,94,94,Juventus,...,95,28,31,23,7,11,15,14,11,€127.1M
1,15,211110,P. Dybala,24,https://cdn.sofifa.org/players/4/19/211110.png,Argentina,https://cdn.sofifa.org/flags/52.png,89,94,Juventus,...,84,23,20,20,5,4,4,5,8,€153.5M
2,24,138956,G. Chiellini,33,https://cdn.sofifa.org/players/4/19/138956.png,Italy,https://cdn.sofifa.org/flags/27.png,89,89,Juventus,...,84,93,93,90,3,3,2,4,3,€44.6M
3,64,191043,Alex Sandro,27,https://cdn.sofifa.org/players/4/19/191043.png,Brazil,https://cdn.sofifa.org/flags/54.png,86,86,Juventus,...,82,81,84,84,7,7,9,12,5,€60.2M
4,65,190483,Douglas Costa,27,https://cdn.sofifa.org/players/4/19/190483.png,Brazil,https://cdn.sofifa.org/flags/54.png,86,86,Juventus,...,84,45,38,34,13,15,9,12,5,€76.7M


In [49]:
# .where(), .startswith() y .endswith()
# Nota: los .where van uno detrás de otro.

# fifa.select("Name", "Club").where(fifa.Name.startswith("L")).where(fifa.Name.endswith("i")).show(5)

fifa.select("Name", "Club")                \
    .where(fifa.Name.startswith("L"))      \
    .where(fifa.Name.endswith("i")).show(5)

+-------------+---------------+
|         Name|           Club|
+-------------+---------------+
|     L. Messi|   FC Barcelona|
|   L. Bonucci|       Juventus|
| L. Fabiański|West Ham United|
|L. Pellegrini|           Roma|
| L. Pavoletti|       Cagliari|
+-------------+---------------+
only showing top 5 rows



In [50]:
# df.shape[0]

fifa.count()

18207

In [52]:
fifa.limit(100).count()

100

In [53]:
fifa.count()

18207

In [51]:
# .limit() para seleccionar el número de filas

df3 = fifa.limit(100)
df3.count()

100

In [54]:
# Nos quedamos con las primeras 5 columnas

col_list = fifa.columns[:5]
df3 = fifa.select(col_list)

In [55]:
df3

DataFrame[_c0: int, ID: int, Name: string, Age: int, Photo: string]

In [56]:
# nuevo df
df3.show(5, False)

+---+------+-----------------+---+----------------------------------------------+
|_c0|ID    |Name             |Age|Photo                                         |
+---+------+-----------------+---+----------------------------------------------+
|0  |158023|L. Messi         |31 |https://cdn.sofifa.org/players/4/19/158023.png|
|1  |20801 |Cristiano Ronaldo|33 |https://cdn.sofifa.org/players/4/19/20801.png |
|2  |190871|Neymar Jr        |26 |https://cdn.sofifa.org/players/4/19/190871.png|
|3  |193080|De Gea           |27 |https://cdn.sofifa.org/players/4/19/193080.png|
|4  |192985|K. De Bruyne     |27 |https://cdn.sofifa.org/players/4/19/192985.png|
+---+------+-----------------+---+----------------------------------------------+
only showing top 5 rows



In [57]:
# .filter(condicion)

fifa.filter("Overall > 50").limit(5).toPandas()

Unnamed: 0,_c0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,...,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Release Clause
0,0,158023,L. Messi,31,https://cdn.sofifa.org/players/4/19/158023.png,Argentina,https://cdn.sofifa.org/flags/52.png,94,94,FC Barcelona,...,96,33,28,26,6,11,15,14,8,€226.5M
1,1,20801,Cristiano Ronaldo,33,https://cdn.sofifa.org/players/4/19/20801.png,Portugal,https://cdn.sofifa.org/flags/38.png,94,94,Juventus,...,95,28,31,23,7,11,15,14,11,€127.1M
2,2,190871,Neymar Jr,26,https://cdn.sofifa.org/players/4/19/190871.png,Brazil,https://cdn.sofifa.org/flags/54.png,92,93,Paris Saint-Germain,...,94,27,24,33,9,9,15,15,11,€228.1M
3,3,193080,De Gea,27,https://cdn.sofifa.org/players/4/19/193080.png,Spain,https://cdn.sofifa.org/flags/45.png,91,93,Manchester United,...,68,15,21,13,90,85,87,88,94,€138.6M
4,4,192985,K. De Bruyne,27,https://cdn.sofifa.org/players/4/19/192985.png,Belgium,https://cdn.sofifa.org/flags/7.png,91,92,Manchester City,...,88,68,58,51,15,13,5,10,13,€196.4M


In [59]:
fifa.filter(col('Overall') > 50).limit(5).toPandas()

Unnamed: 0,_c0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,...,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Release Clause
0,0,158023,L. Messi,31,https://cdn.sofifa.org/players/4/19/158023.png,Argentina,https://cdn.sofifa.org/flags/52.png,94,94,FC Barcelona,...,96,33,28,26,6,11,15,14,8,€226.5M
1,1,20801,Cristiano Ronaldo,33,https://cdn.sofifa.org/players/4/19/20801.png,Portugal,https://cdn.sofifa.org/flags/38.png,94,94,Juventus,...,95,28,31,23,7,11,15,14,11,€127.1M
2,2,190871,Neymar Jr,26,https://cdn.sofifa.org/players/4/19/190871.png,Brazil,https://cdn.sofifa.org/flags/54.png,92,93,Paris Saint-Germain,...,94,27,24,33,9,9,15,15,11,€228.1M
3,3,193080,De Gea,27,https://cdn.sofifa.org/players/4/19/193080.png,Spain,https://cdn.sofifa.org/flags/45.png,91,93,Manchester United,...,68,15,21,13,90,85,87,88,94,€138.6M
4,4,192985,K. De Bruyne,27,https://cdn.sofifa.org/players/4/19/192985.png,Belgium,https://cdn.sofifa.org/flags/7.png,91,92,Manchester City,...,88,68,58,51,15,13,5,10,13,€196.4M


In [62]:
%%time
# Podemos usar .filter en conjunto con .select

fifa.filter("Overall > 50").select(["Name", "Age"]).toPandas()

CPU times: total: 141 ms
Wall time: 457 ms


Unnamed: 0,Name,Age
0,L. Messi,31
1,Cristiano Ronaldo,33
2,Neymar Jr,26
3,De Gea,27
4,K. De Bruyne,27
...,...,...
18010,Zhu Zhengyu,23
18011,J. Ellis,17
18012,B. Galach,17
18013,W. Møller,20


In [63]:
%%time
# El orden no afecta el output .select .filter

fifa.select(["Name", "Age"]).filter("Overall > 50").toPandas()

CPU times: total: 15.6 ms
Wall time: 187 ms


Unnamed: 0,Name,Age
0,L. Messi,31
1,Cristiano Ronaldo,33
2,Neymar Jr,26
3,De Gea,27
4,K. De Bruyne,27
...,...,...
18010,Zhu Zhengyu,23
18011,J. Ellis,17
18012,B. Galach,17
18013,W. Møller,20


In [64]:
# Varias condiciones AND & OR

fifa.select(["Name", "Age", "Club"]).filter("Overall > 50 AND Age < 30 AND Club = 'FC Barcelona'").limit(5).toPandas()

Unnamed: 0,Name,Age,Club
0,M. ter Stegen,26,FC Barcelona
1,Sergio Busquets,29,FC Barcelona
2,Coutinho,26,FC Barcelona
3,S. Umtiti,24,FC Barcelona
4,Jordi Alba,29,FC Barcelona


In [68]:
(
    fifa
    .select(
        [
            "Name", 
            "Age", 
            "Club"
        ]
    )
    .filter(
        (col('Overall') > 50) & 
        (col('Age') < 30) & 
        (col('Club') == 'FC Barcelona')
    )
    .limit(5)
).toPandas()

Unnamed: 0,Name,Age,Club
0,Neymar Jr,26,Paris Saint-Germain
1,De Gea,27,Manchester United
2,K. De Bruyne,27,Manchester City
3,E. Hazard,27,Chelsea
4,J. Oblak,25,Atlético Madrid


In [69]:
fifa.select(["Name", "Age", "Club"]).filter("Club = 'Juventus' OR Club = 'FC Barcelona'").limit(5).toPandas()

Unnamed: 0,Name,Age,Club
0,L. Messi,31,FC Barcelona
1,Cristiano Ronaldo,33,Juventus
2,L. Suárez,31,FC Barcelona
3,P. Dybala,24,Juventus
4,M. ter Stegen,26,FC Barcelona


In [71]:
(
    fifa
    .select(
        'Name',
        'Age',
        'Club'
    )
    .filter(
        (col('Club')=='Juventus')
        | (col('Club')=='FC Barcelona')
    )
    .limit(5)
).toPandas()

Unnamed: 0,Name,Age,Club
0,Cristiano Ronaldo,33,Juventus
1,P. Dybala,24,Juventus
2,G. Chiellini,33,Juventus
3,Alex Sandro,27,Juventus
4,Douglas Costa,27,Juventus


In [77]:
# .collect() "transforma" el output a list

result = fifa.filter("Overall > 50")                           \
             .select(["Nationality", "Name", "Age", "Overall"])\
             .orderBy(fifa["Overall"].desc()).limit(5).collect()

result

[Row(Nationality='Argentina', Name='L. Messi', Age=31, Overall=94),
 Row(Nationality='Portugal', Name='Cristiano Ronaldo', Age=33, Overall=94),
 Row(Nationality='Brazil', Name='Neymar Jr', Age=26, Overall=92),
 Row(Nationality='Belgium', Name='K. De Bruyne', Age=27, Overall=91),
 Row(Nationality='Spain', Name='De Gea', Age=27, Overall=91)]

In [78]:
# result
print("Mejor jugador Overall>50", result[0][1])

Mejor jugador Overall>50 L. Messi


In [79]:
# fifa
print("Mejor jugador Overall>50", fifa[0][1])

Mejor jugador Overall>50 Column<'_c0[1]'>


In [80]:
# result
print("Peor jugador Overall<50", result[-1][1])

Peor jugador Overall<50 De Gea


### Manipulacion de DataFrames

In [82]:
from pyspark.sql.functions import *

# concat_ws()

concat = fifa.select(fifa.Name,
                     fifa.Nationality,
                     concat_ws(" ", fifa.Name, fifa.Nationality).alias("Nombre/Nacionalidad"))

concat.show(truncate = False)

+-----------------+-----------+--------------------------+
|Name             |Nationality|Nombre/Nacionalidad       |
+-----------------+-----------+--------------------------+
|L. Messi         |Argentina  |L. Messi Argentina        |
|Cristiano Ronaldo|Portugal   |Cristiano Ronaldo Portugal|
|Neymar Jr        |Brazil     |Neymar Jr Brazil          |
|De Gea           |Spain      |De Gea Spain              |
|K. De Bruyne     |Belgium    |K. De Bruyne Belgium      |
|E. Hazard        |Belgium    |E. Hazard Belgium         |
|L. Modrić        |Croatia    |L. Modrić Croatia         |
|L. Suárez        |Uruguay    |L. Suárez Uruguay         |
|Sergio Ramos     |Spain      |Sergio Ramos Spain        |
|J. Oblak         |Slovenia   |J. Oblak Slovenia         |
|R. Lewandowski   |Poland     |R. Lewandowski Poland     |
|T. Kroos         |Germany    |T. Kroos Germany          |
|D. Godín         |Uruguay    |D. Godín Uruguay          |
|David Silva      |Spain      |David Silva Spain        

In [87]:
# concat_ws()

concat = fifa.select(fifa.Name,
                     fifa.Nationality,
                     concat_ws("-", fifa.Name, fifa.Nationality, fifa.Club, fifa.Age).alias("Nombre/Nacionalidad"))

concat.show(truncate = False)

+-----------------+-----------+------------------------------------------+
|Name             |Nationality|Nombre/Nacionalidad                       |
+-----------------+-----------+------------------------------------------+
|L. Messi         |Argentina  |L. Messi-Argentina-FC Barcelona-31        |
|Cristiano Ronaldo|Portugal   |Cristiano Ronaldo-Portugal-Juventus-33    |
|Neymar Jr        |Brazil     |Neymar Jr-Brazil-Paris Saint-Germain-26   |
|De Gea           |Spain      |De Gea-Spain-Manchester United-27         |
|K. De Bruyne     |Belgium    |K. De Bruyne-Belgium-Manchester City-27   |
|E. Hazard        |Belgium    |E. Hazard-Belgium-Chelsea-27              |
|L. Modrić        |Croatia    |L. Modrić-Croatia-Real Madrid-32          |
|L. Suárez        |Uruguay    |L. Suárez-Uruguay-FC Barcelona-31         |
|Sergio Ramos     |Spain      |Sergio Ramos-Spain-Real Madrid-32         |
|J. Oblak         |Slovenia   |J. Oblak-Slovenia-Atlético Madrid-25      |
|R. Lewandowski   |Poland

In [90]:
(
    fifa
    .select(
        'Name',
        'Nationality',
        'Club',
        'Age'
    )
    .withColumn(
        'todo_unido', 
        concat_ws(
            '-', 
            col('Name'),
            col('Nationality'),
            col('Age'),
            col('Club')
        )
    )
).show(5, truncate=False)

+-----------------+-----------+-------------------+---+---------------------------------------+
|Name             |Nationality|Club               |Age|todo_unido                             |
+-----------------+-----------+-------------------+---+---------------------------------------+
|L. Messi         |Argentina  |FC Barcelona       |31 |L. Messi-Argentina-31-FC Barcelona     |
|Cristiano Ronaldo|Portugal   |Juventus           |33 |Cristiano Ronaldo-Portugal-33-Juventus |
|Neymar Jr        |Brazil     |Paris Saint-Germain|26 |Neymar Jr-Brazil-26-Paris Saint-Germain|
|De Gea           |Spain      |Manchester United  |27 |De Gea-Spain-27-Manchester United      |
|K. De Bruyne     |Belgium    |Manchester City    |27 |K. De Bruyne-Belgium-27-Manchester City|
+-----------------+-----------+-------------------+---+---------------------------------------+
only showing top 5 rows



In [91]:
concat.rdd.id()

323

In [92]:
# Nuevo df

videos = spark.read.csv(path = "../data/youtubevideos.csv",
                        header = True, inferSchema = True)

videos.limit(3).toPandas()

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,2kyS6SvSYSE,17.14.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
1,1ZAPwfrtAFY,17.14.11,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13T07:30:00.000Z,"""last week tonight trump presidency""|""last wee...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John..."
2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"""racist superman""|""rudy""|""mancuso""|""king""|""bac...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...


In [93]:
videos.printSchema()

root
 |-- video_id: string (nullable = true)
 |-- trending_date: string (nullable = true)
 |-- title: string (nullable = true)
 |-- channel_title: string (nullable = true)
 |-- category_id: string (nullable = true)
 |-- publish_time: string (nullable = true)
 |-- tags: string (nullable = true)
 |-- views: string (nullable = true)
 |-- likes: string (nullable = true)
 |-- dislikes: string (nullable = true)
 |-- comment_count: string (nullable = true)
 |-- thumbnail_link: string (nullable = true)
 |-- comments_disabled: string (nullable = true)
 |-- ratings_disabled: string (nullable = true)
 |-- video_error_or_removed: string (nullable = true)
 |-- description: string (nullable = true)



In [94]:
# Podemos reasignar las columnas usando .withColumn en conjunto con .cast, to_date o to_timestamp

df = videos.withColumn("views"        , videos["views"].cast(IntegerType()))                        \
           .withColumn("likes"        , videos["likes"].cast(IntegerType()))                        \
           .withColumn("dislikes"     , videos["dislikes"].cast(IntegerType()))                     \
           .withColumn("category_id"  , videos["category_id"].cast(IntegerType()))                  \
           .withColumn("trending_date", to_date(videos.trending_date, "yy.dd.mm")) 

In [95]:
df.printSchema()

root
 |-- video_id: string (nullable = true)
 |-- trending_date: date (nullable = true)
 |-- title: string (nullable = true)
 |-- channel_title: string (nullable = true)
 |-- category_id: integer (nullable = true)
 |-- publish_time: string (nullable = true)
 |-- tags: string (nullable = true)
 |-- views: integer (nullable = true)
 |-- likes: integer (nullable = true)
 |-- dislikes: integer (nullable = true)
 |-- comment_count: string (nullable = true)
 |-- thumbnail_link: string (nullable = true)
 |-- comments_disabled: string (nullable = true)
 |-- ratings_disabled: string (nullable = true)
 |-- video_error_or_removed: string (nullable = true)
 |-- description: string (nullable = true)



In [96]:
df.limit(3).toPandas()

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,2kyS6SvSYSE,2017-01-14,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
1,1ZAPwfrtAFY,2017-01-14,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13T07:30:00.000Z,"""last week tonight trump presidency""|""last wee...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John..."
2,5qpjK5DgCt4,2017-01-14,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"""racist superman""|""rudy""|""mancuso""|""king""|""bac...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...


In [97]:
# .withColumn() también nos permite crear columnas a partir de otras

df = df.withColumn("publish_time_2", regexp_replace(df.publish_time, "T", " "))
df = df.withColumn("publish_time_2", regexp_replace(df.publish_time_2, "Z", ""))

df.select("publish_time", "publish_time_2").show(5, truncate = False)

+------------------------+-----------------------+
|publish_time            |publish_time_2         |
+------------------------+-----------------------+
|2017-11-13T17:13:01.000Z|2017-11-13 17:13:01.000|
|2017-11-13T07:30:00.000Z|2017-11-13 07:30:00.000|
|2017-11-12T19:05:24.000Z|2017-11-12 19:05:24.000|
|2017-11-13T11:00:04.000Z|2017-11-13 11:00:04.000|
|2017-11-12T18:01:41.000Z|2017-11-12 18:01:41.000|
+------------------------+-----------------------+
only showing top 5 rows



In [98]:
# lower()
df.select("title", lower(df.title)).show(5, False)

+--------------------------------------------------------------+--------------------------------------------------------------+
|title                                                         |lower(title)                                                  |
+--------------------------------------------------------------+--------------------------------------------------------------+
|WE WANT TO TALK ABOUT OUR MARRIAGE                            |we want to talk about our marriage                            |
|The Trump Presidency: Last Week Tonight with John Oliver (HBO)|the trump presidency: last week tonight with john oliver (hbo)|
|Racist Superman | Rudy Mancuso, King Bach & Lele Pons         |racist superman | rudy mancuso, king bach & lele pons         |
|Nickelback Lyrics: Real or Fake?                              |nickelback lyrics: real or fake?                              |
|I Dare You: GOING BALD!?                                      |i dare you: going bald!?                

In [99]:
# when(), puede crear columnas a partir de otras si se cumple cierta condición

df.select("likes",
          "dislikes",
          (when(df.likes > df.dislikes, "Good").when(df.likes < df.dislikes, "Bad").when(df.likes == df.dislikes, "Equal")\
          .otherwise("Undetermined")).alias("Favorability")).show(5)

# otherwise() se usa cuando no se resuelve la condicion, y esto puede suceder, por ejemplo, cuando hay NaN's

+------+--------+------------+
| likes|dislikes|Favorability|
+------+--------+------------+
| 57527|    2966|        Good|
| 97185|    6146|        Good|
|146033|    5339|        Good|
| 10172|     666|        Good|
|132235|    1989|        Good|
+------+--------+------------+
only showing top 5 rows



In [101]:
# expr

# con expr podemos escribir en sintaxis SQL como queremos la nueva columna

(
    df
    .select("likes",
          "dislikes",
          expr("CASE WHEN likes > dislikes THEN 'Good' \
                     WHEN dislikes > likes THEN 'Bad'  \
                     WHEN likes = dislikes THEN 'Equal'\
                     ELSE 'Undetermined' END           \
                AS Favorability")
         )
    .groupBy('Favorability')
    .count()
).show(5)

+------------+-----+
|Favorability|count|
+------------+-----+
|       Equal|  181|
|        Good|40192|
|         Bad|  576|
|Undetermined| 7188|
+------------+-----+



In [106]:
# year() , month() & dayofmonth
# Esto funciona porque la columna esta en formato DateType()

df.select("trending_date",
          year("trending_date").alias("year"),
          month("trending_date").alias("month"),
          dayofmonth("trending_date").alias("day"),
          dayofweek("trending_date").alias("day_of_week"),
          dayofyear("trending_date").alias("day_of_year")
         ).show(5)

+-------------+----+-----+---+-----------+-----------+
|trending_date|year|month|day|day_of_week|day_of_year|
+-------------+----+-----+---+-----------+-----------+
|   2017-01-14|2017|    1| 14|          7|         14|
|   2017-01-14|2017|    1| 14|          7|         14|
|   2017-01-14|2017|    1| 14|          7|         14|
|   2017-01-14|2017|    1| 14|          7|         14|
|   2017-01-14|2017|    1| 14|          7|         14|
+-------------+----+-----+---+-----------+-----------+
only showing top 5 rows



In [107]:
# datediff()
# Esto funciona porque las columnas estan en formato DateType()

df.select("trending_date",
          "publish_time_2",
          datediff(df.publish_time_2, df.trending_date)).show(10, False)

+-------------+-----------------------+---------------------------------------+
|trending_date|publish_time_2         |datediff(publish_time_2, trending_date)|
+-------------+-----------------------+---------------------------------------+
|2017-01-14   |2017-11-13 17:13:01.000|303                                    |
|2017-01-14   |2017-11-13 07:30:00.000|303                                    |
|2017-01-14   |2017-11-12 19:05:24.000|302                                    |
|2017-01-14   |2017-11-13 11:00:04.000|303                                    |
|2017-01-14   |2017-11-12 18:01:41.000|302                                    |
|2017-01-14   |2017-11-13 19:07:23.000|303                                    |
|2017-01-14   |2017-11-12 05:37:17.000|302                                    |
|2017-01-14   |2017-11-12 21:50:37.000|302                                    |
|2017-01-14   |2017-11-13 14:00:23.000|303                                    |
|2017-01-14   |2017-11-13 13:45:16.000|3

In [108]:
# split()
array = df.select("title",
                  split(df.title, " ").alias("split"))

array.show(5, False)

+--------------------------------------------------------------+-------------------------------------------------------------------------+
|title                                                         |split                                                                    |
+--------------------------------------------------------------+-------------------------------------------------------------------------+
|WE WANT TO TALK ABOUT OUR MARRIAGE                            |[WE, WANT, TO, TALK, ABOUT, OUR, MARRIAGE]                               |
|The Trump Presidency: Last Week Tonight with John Oliver (HBO)|[The, Trump, Presidency:, Last, Week, Tonight, with, John, Oliver, (HBO)]|
|Racist Superman | Rudy Mancuso, King Bach & Lele Pons         |[Racist, Superman, |, Rudy, Mancuso,, King, Bach, &, Lele, Pons]         |
|Nickelback Lyrics: Real or Fake?                              |[Nickelback, Lyrics:, Real, or, Fake?]                                   |
|I Dare You: GOING BALD!?  

In [109]:
# array_contains parecido a "in" en python

array.select("split",
             array_contains(array.split, "(HBO)")).show(5, False)

+-------------------------------------------------------------------------+----------------------------+
|split                                                                    |array_contains(split, (HBO))|
+-------------------------------------------------------------------------+----------------------------+
|[WE, WANT, TO, TALK, ABOUT, OUR, MARRIAGE]                               |false                       |
|[The, Trump, Presidency:, Last, Week, Tonight, with, John, Oliver, (HBO)]|true                        |
|[Racist, Superman, |, Rudy, Mancuso,, King, Bach, &, Lele, Pons]         |false                       |
|[Nickelback, Lyrics:, Real, or, Fake?]                                   |false                       |
|[I, Dare, You:, GOING, BALD!?]                                           |false                       |
+-------------------------------------------------------------------------+----------------------------+
only showing top 5 rows



In [114]:
# vamos a chequear las filas que contienen la condición "(HBO)" en su lista split y la llamamos checks,
# seguidamente selecionamos las filas de la columna split que son distintas con el método distinct(),
# y para finalizar filtramos por las columnas que cumplen la condición inicial

(
    array
    .select(
        "split",
        array_contains(
            array.split, 
            "(HBO)"
        ).alias('checks')
    )
    .select('split').distinct()
    .filter(col('checks')==True)
    .show(5, False)
)

+-------------------------------------------------------------------------------------+
|split                                                                                |
+-------------------------------------------------------------------------------------+
|[The, Trump, Presidency:, Last, Week, Tonight, with, John, Oliver, (HBO)]            |
|[What, It's, Like, To, Be, Absolutely, Obsessed, With, Bitcoin, (HBO)]               |
|[Watch, Silicon, Valley, Nerds, Face, Off, A, Capella, (HBO)]                        |
|[Last, Week, Tonight:, Season, 5, Official, Trailer, (HBO)]                          |
|[This, Hidden, 300, Foot, Stretch, Of, The, Berlin, Wall, Is, Still, Standing, (HBO)]|
+-------------------------------------------------------------------------------------+
only showing top 5 rows



In [115]:
# array_distinct parecido a .unique() en Pandas

array.select("title", array_distinct(array.split)).show(10, False)

+-----------------------------------------------------------------+---------------------------------------------------------------------------+
|title                                                            |array_distinct(split)                                                      |
+-----------------------------------------------------------------+---------------------------------------------------------------------------+
|WE WANT TO TALK ABOUT OUR MARRIAGE                               |[WE, WANT, TO, TALK, ABOUT, OUR, MARRIAGE]                                 |
|The Trump Presidency: Last Week Tonight with John Oliver (HBO)   |[The, Trump, Presidency:, Last, Week, Tonight, with, John, Oliver, (HBO)]  |
|Racist Superman | Rudy Mancuso, King Bach & Lele Pons            |[Racist, Superman, |, Rudy, Mancuso,, King, Bach, &, Lele, Pons]           |
|Nickelback Lyrics: Real or Fake?                                 |[Nickelback, Lyrics:, Real, or, Fake?]                               

In [116]:
# array_remove eliminar un elemento de un array 

array.select("title", array_remove(array.split, "Presidency:")).show(5, False)

+--------------------------------------------------------------+----------------------------------------------------------------+
|title                                                         |array_remove(split, Presidency:)                                |
+--------------------------------------------------------------+----------------------------------------------------------------+
|WE WANT TO TALK ABOUT OUR MARRIAGE                            |[WE, WANT, TO, TALK, ABOUT, OUR, MARRIAGE]                      |
|The Trump Presidency: Last Week Tonight with John Oliver (HBO)|[The, Trump, Last, Week, Tonight, with, John, Oliver, (HBO)]    |
|Racist Superman | Rudy Mancuso, King Bach & Lele Pons         |[Racist, Superman, |, Rudy, Mancuso,, King, Bach, &, Lele, Pons]|
|Nickelback Lyrics: Real or Fake?                              |[Nickelback, Lyrics:, Real, or, Fake?]                          |
|I Dare You: GOING BALD!?                                      |[I, Dare, You:, GOING, BAL

### UDF

In [5]:
# Podemos usar funciones para crear nuevas columnas

from pyspark.sql.functions import udf          # user define functions
from pyspark.sql.types import IntegerType

In [121]:
# El retorno de lambda 

def square(x):
    return int(x**2)

square_udf = udf(f          = lambda x : square(x),
                 returnType = IntegerType()
                )

df.select("dislikes",
          square_udf("dislikes").alias("dislikes**2")).where(col("dislikes").isNotNull()).toPandas().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40949 entries, 0 to 40948
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   dislikes     40949 non-null  int32
 1   dislikes**2  40949 non-null  int32
dtypes: int32(2)
memory usage: 320.0 KB


In [122]:
(
    df
    .filter(col('dislikes').isNotNull()) #filtro los valores nulos
    .withColumn('dislikes**2', square_udf('dislikes')) # genero una columna nueva con mi udf
    .select('dislikes', 'dislikes**2')
).show(5)

+--------+-----------+
|dislikes|dislikes**2|
+--------+-----------+
|    2966|    8797156|
|    6146|   37773316|
|    5339|   28504921|
|     666|     443556|
|    1989|    3956121|
+--------+-----------+
only showing top 5 rows



In [None]:
# Si ejecutamos sin usar .isNotNull() nos dará error porque hay NaN's
# df.select("dislikes", square_udf("dislikes")).show(5)

### Aggregate Functions

In [123]:
# igual que la funcion .groupBy() y .agg() de pandas

fifa.groupBy("Club", "Nationality").agg({"ID" : "count"}).show(1_000, truncate = False)

+-----------------------------------+--------------------+---------+
|Club                               |Nationality         |count(ID)|
+-----------------------------------+--------------------+---------+
|Juventus                           |Argentina           |1        |
|Manchester United                  |England             |11       |
|Sevilla FC                         |Denmark             |1        |
|Watford                            |Argentina           |1        |
|Burnley                            |Wales               |1        |
|Atiker Konyaspor                   |Turkey              |11       |
|Beşiktaş JK                        |Canada              |1        |
|Vitesse                            |South Africa        |1        |
|Santos Laguna                      |Uruguay             |3        |
|New York Red Bulls                 |United States       |14       |
|Rayo Vallecano                     |Portugal            |1        |
|Molde FK                         

In [124]:
(
    fifa
    .groupBy(
        "Club", 
        "Nationality"
    )
    .agg(
        {
            "ID" : "count",
            "Age": "mean",
            "Age": "max",
            "Age":'min',
            "Overall": 'mean'
        }
    )
).show(10, truncate = False)

+------------------+-------------+-----------------+---------+--------+
|Club              |Nationality  |avg(Overall)     |count(ID)|min(Age)|
+------------------+-------------+-----------------+---------+--------+
|Juventus          |Argentina    |89.0             |1        |24      |
|Manchester United |England      |74.0909090909091 |11       |17      |
|Sevilla FC        |Denmark      |81.0             |1        |29      |
|Watford           |Argentina    |80.0             |1        |27      |
|Burnley           |Wales        |76.0             |1        |28      |
|Atiker Konyaspor  |Turkey       |69.36363636363636|11       |21      |
|Beşiktaş JK       |Canada       |75.0             |1        |23      |
|Vitesse           |South Africa |75.0             |1        |28      |
|Santos Laguna     |Uruguay      |72.66666666666667|3        |24      |
|New York Red Bulls|United States|65.57142857142857|14       |18      |
+------------------+-------------+-----------------+---------+--

In [129]:
(
    fifa
    .groupBy(
        "Club", 
        "Nationality"
    )
    .agg(
        mean('Age').alias('Mean_Age'),
        min('Age').alias('Min_Age'),
        max('Age').alias('Max_Age'),
        mean('Overall').alias('Mean_Overall'),
        count('ID').alias('Count')
    )
).show(10, truncate = False)

+------------------+-------------+------------------+-------+-------+-----------------+-----+
|Club              |Nationality  |Mean_Age          |Min_Age|Max_Age|Mean_Overall     |Count|
+------------------+-------------+------------------+-------+-------+-----------------+-----+
|Juventus          |Argentina    |24.0              |24     |24     |89.0             |1    |
|Manchester United |England      |23.818181818181817|17     |35     |74.0909090909091 |11   |
|Sevilla FC        |Denmark      |29.0              |29     |29     |81.0             |1    |
|Watford           |Argentina    |27.0              |27     |27     |80.0             |1    |
|Burnley           |Wales        |28.0              |28     |28     |76.0             |1    |
|Atiker Konyaspor  |Turkey       |27.727272727272727|21     |34     |69.36363636363636|11   |
|Beşiktaş JK       |Canada       |23.0              |23     |23     |75.0             |1    |
|Vitesse           |South Africa |28.0              |28     

In [6]:
import pandas as pd

df_fifa = pd.read_csv(filepath_or_buffer = "../data/fifa19.csv")

df_fifa.groupby(["Club", "Nationality"]).agg({"ID" : "count"})

Unnamed: 0_level_0,Unnamed: 1_level_0,ID
Club,Nationality,Unnamed: 2_level_1
SSV Jahn Regensburg,Armenia,1
SSV Jahn Regensburg,Denmark,1
SSV Jahn Regensburg,Germany,22
SSV Jahn Regensburg,Kosovo,1
SSV Jahn Regensburg,Lithuania,1
...,...,...
Śląsk Wrocław,Latvia,1
Śląsk Wrocław,Poland,20
Śląsk Wrocław,Portugal,1
Śląsk Wrocław,Serbia,1


In [132]:
# Con esta notación podemos agregar .alias a las columnas

fifa.groupBy("Club").agg(min(fifa.Age).alias("Min Age"),
                         max(fifa.Age).alias("Max Age")).show()

+--------------------+-------+-------+
|                Club|Min Age|Max Age|
+--------------------+-------+-------+
|             Palermo|     18|     37|
|          Göztepe SK|     17|     36|
|CD Everton de Viñ...|     18|     31|
|     Shonan Bellmare|     19|     38|
|          Sagan Tosu|     19|     34|
|  1. FC Union Berlin|     18|     32|
|               Carpi|     18|     31|
|           Puebla FC|     19|     35|
|  Argentinos Juniors|     17|     35|
|     SC Paderborn 07|     18|     36|
|       Karlsruher SC|     18|     35|
|         SC Freiburg|     19|     31|
|San Lorenzo de Al...|     19|     38|
|  SpVgg Unterhaching|     18|     39|
|Universidad Católica|     17|     33|
|         GFC Ajaccio|     18|     35|
|           FC Luzern|     18|     34|
|                 AIK|     17|     38|
|       SC Heerenveen|     17|     34|
|              Santos|     26|     34|
+--------------------+-------+-------+
only showing top 20 rows



In [134]:
# Con .summary() podemos obtener un resultado similar

videos.select("views", "likes", "dislikes")                                      \
      .summary("count", "min", "25%", "50%", "75%", "max", "stddev", "mean").limit(6).toPandas()

Unnamed: 0,summary,views,likes,dislikes
0,count,41061,41043,41035
1,min,Geno’s,Kendall Jenner and Kate Upton for a look at t...,D'Onofrio makes fusilli al ferretto
2,25%,242240.0,5417.0,202.0
3,50%,681439.0,18084.0,631.0
4,75%,1822798.0,55405.0,1937.0
5,max,99999,99990,9993


### Joins

In [7]:
titanic1 = spark.read.csv(path = "../data/titanic 1.csv",
                          inferSchema = True, header = True)

titanic2 = spark.read.csv(path = "..//data/titanic 2.csv",
                          inferSchema = True, header = True)

In [8]:
titanic1.limit(3).toPandas()

Unnamed: 0,PassengerId,Name,Sex,Age
0,1,"Braund, Mr. Owen Harris",male,22.0
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0
2,3,"Heikkinen, Miss. Laina",female,26.0


In [9]:
titanic2.limit(3).toPandas()

Unnamed: 0,PassengerId,Survived,Pclass,Ticket,Fare
0,1,0,3,A/5 21171,7.25
1,2,1,1,PC 17599,71.2833
2,3,1,3,STON/O2. 3101282,7.925


In [11]:
titanic1.count(), titanic2.count()

(891, 891)

In [10]:
# .union funciona como pd.concat, solo funciona para axis = 0
# Los dfs deben tener la misma cantidad de columnas para funcionar
# Agrega las filas

titanic = titanic1.union(titanic1)

print(titanic1.count())
print(titanic.count())

891
1782


In [12]:
# Inner Joins
titanic = titanic1.join(other = titanic2, on = ["PassengerId"], how = "inner")

titanic.show()

+-----------+--------------------+------+----+--------+------+----------------+-------+
|PassengerId|                Name|   Sex| Age|Survived|Pclass|          Ticket|   Fare|
+-----------+--------------------+------+----+--------+------+----------------+-------+
|          1|Braund, Mr. Owen ...|  male|22.0|       0|     3|       A/5 21171|   7.25|
|          2|Cumings, Mrs. Joh...|female|38.0|       1|     1|        PC 17599|71.2833|
|          3|Heikkinen, Miss. ...|female|26.0|       1|     3|STON/O2. 3101282|  7.925|
|          4|Futrelle, Mrs. Ja...|female|35.0|       1|     1|          113803|   53.1|
|          5|Allen, Mr. Willia...|  male|35.0|       0|     3|          373450|   8.05|
|          6|    Moran, Mr. James|  male|null|       0|     3|          330877| 8.4583|
|          7|McCarthy, Mr. Tim...|  male|54.0|       0|     1|           17463|51.8625|
|          8|Palsson, Master. ...|  male| 2.0|       0|     3|          349909| 21.075|
|          9|Johnson, Mrs. Osc..

In [13]:
titanic.count()

891

### Missing Values

In [14]:
# Filtramos con isNull()

titanic.select(["Name", "PassengerId", "Age"]).filter(titanic.Age.isNull()).show(5)

+--------------------+-----------+----+
|                Name|PassengerId| Age|
+--------------------+-----------+----+
|    Moran, Mr. James|          6|null|
|Williams, Mr. Cha...|         18|null|
|Masselmani, Mrs. ...|         20|null|
|Emir, Mr. Farred ...|         27|null|
|"O'Dwyer, Miss. E...|         29|null|
+--------------------+-----------+----+
only showing top 5 rows



In [15]:
titanic.select(["Name", "PassengerId", "Age"]).filter(titanic.Age.isNull()).count()

177

In [16]:
# Con esta funcion podemos contar cuantas filas tienen NaN's

from pyspark.sql.functions import *

def null_value_calc(df):
    null_columns_counts = list()
    numRows = df.count()
    
    for k in df.columns:
        nullRows = df.where(col(k).isNull()).count()
        
        if (nullRows > 0):
            temp = k, nullRows, (nullRows / numRows)*100
            null_columns_counts.append(temp)
            
    return null_columns_counts

null_columns_calc_list = null_value_calc(titanic)

null_columns_calc_list

[('Age', 177, 19.865319865319865)]

In [17]:
spark.createDataFrame(data = null_columns_calc_list,
                      schema = ["Name", "Count", "Percent"]).show()

+----+-----+------------------+
|Name|Count|           Percent|
+----+-----+------------------+
| Age|  177|19.865319865319865|
+----+-----+------------------+



In [18]:
spark.createDataFrame(data = null_columns_calc_list,
                      schema = ["Name", "Count", "Percent"]).toPandas().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Name     1 non-null      object 
 1   Count    1 non-null      int64  
 2   Percent  1 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 152.0+ bytes


In [19]:
# df.na.drop() = df.dropna()

titanic.na.drop().limit(6).toPandas()

Unnamed: 0,PassengerId,Name,Sex,Age,Survived,Pclass,Ticket,Fare
0,1,"Braund, Mr. Owen Harris",male,22.0,0,3,A/5 21171,7.25
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,1,PC 17599,71.2833
2,3,"Heikkinen, Miss. Laina",female,26.0,1,3,STON/O2. 3101282,7.925
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,1,113803,53.1
4,5,"Allen, Mr. William Henry",male,35.0,0,3,373450,8.05
5,7,"McCarthy, Mr. Timothy J",male,54.0,0,1,17463,51.8625


In [20]:
titanic.na.drop().count()

714

In [21]:
# .na.drop() sin parametros

og_len = titanic.count()
drop_len = titanic.na.drop().count()

print("Filas eliminadas", og_len - drop_len)
print("Porcentaje de filas eliminadas", (og_len - drop_len)/og_len*100)

Filas eliminadas 177
Porcentaje de filas eliminadas 19.865319865319865


In [26]:
titanic.printSchema()

root
 |-- PassengerId: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)



In [22]:
# .na.drop() con threshold = 8 minimo de columnas sin un valor nulo en la fila

og_len = titanic.count()
drop_len = titanic.na.drop(thresh = 8).count()

print("Filas eliminadas", og_len - drop_len)
print("Porcentaje de filas eliminadas", (og_len - drop_len)/og_len*100)

Filas eliminadas 177
Porcentaje de filas eliminadas 19.865319865319865


In [28]:
# .na.drop() con threshold = 6

og_len = titanic.count()
drop_len = titanic.na.drop(thresh = 6).count()
print("Filas eliminadas", og_len - drop_len)

print("Porcentaje de filas eliminadas", (og_len - drop_len)/og_len*100)

Filas eliminadas 0
Porcentaje de filas eliminadas 0.0


In [24]:
# .na.drop() podemos elegir por cual columna eliminar las filas

og_len = titanic.count()
drop_len = titanic.na.drop(subset = ["Age"]).count()

print("Filas eliminadas", og_len - drop_len)
print("Porcentaje de filas eliminadas", (og_len - drop_len)/og_len*100)

Filas eliminadas 177
Porcentaje de filas eliminadas 19.865319865319865


In [25]:
# .na.drop() con how = "all" (toda la fila debe tener NaN's)

og_len = titanic.count()
drop_len = titanic.na.drop(how = "all").count()

print("Filas eliminadas", og_len - drop_len)
print("Porcentaje de filas eliminadas", (og_len - drop_len)/og_len*100)

Filas eliminadas 0
Porcentaje de filas eliminadas 0.0


### Fill NaN's

In [29]:
# na.fill(value), "value" debe coincidir con el dtype de la columna
# Si esto no se cumple, na.fill() no hará nada

titanic.na.fill(value = 9999).limit(6).toPandas()

Unnamed: 0,PassengerId,Name,Sex,Age,Survived,Pclass,Ticket,Fare
0,1,"Braund, Mr. Owen Harris",male,22.0,0,3,A/5 21171,7.25
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,1,PC 17599,71.2833
2,3,"Heikkinen, Miss. Laina",female,26.0,1,3,STON/O2. 3101282,7.925
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,1,113803,53.1
4,5,"Allen, Mr. William Henry",male,35.0,0,3,373450,8.05
5,6,"Moran, Mr. James",male,9999.0,0,3,330877,8.4583


In [30]:
# fila 6
titanic.na.fill(value = "NO AGE").limit(6).toPandas()

Unnamed: 0,PassengerId,Name,Sex,Age,Survived,Pclass,Ticket,Fare
0,1,"Braund, Mr. Owen Harris",male,22.0,0,3,A/5 21171,7.25
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,1,PC 17599,71.2833
2,3,"Heikkinen, Miss. Laina",female,26.0,1,3,STON/O2. 3101282,7.925
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,1,113803,53.1
4,5,"Allen, Mr. William Henry",male,35.0,0,3,373450,8.05
5,6,"Moran, Mr. James",male,,0,3,330877,8.4583


In [34]:
titanic.na.fill(value = 9999).groupBy('Age').count().orderBy(col('Age').desc()).show()

+------+-----+
|   Age|count|
+------+-----+
|9999.0|  177|
|  80.0|    1|
|  74.0|    1|
|  71.0|    2|
|  70.5|    1|
|  70.0|    2|
|  66.0|    1|
|  65.0|    3|
|  64.0|    2|
|  63.0|    2|
|  62.0|    4|
|  61.0|    3|
|  60.0|    4|
|  59.0|    2|
|  58.0|    5|
|  57.0|    2|
|  56.0|    4|
|  55.5|    1|
|  55.0|    2|
|  54.0|    8|
+------+-----+
only showing top 20 rows



Si tratamos de rellenar con un valor que no corresponde con el tipo de dato de la columna no hace nada

In [33]:
titanic.na.fill(value = "NO AGE").groupBy('Age').count().show()

+----+-----+
| Age|count|
+----+-----+
| 8.0|    4|
|70.0|    2|
| 7.0|    3|
|20.5|    1|
|49.0|    6|
|29.0|   20|
|40.5|    2|
|64.0|    2|
|47.0|    9|
|42.0|   13|
|24.5|    1|
|44.0|    9|
|35.0|   18|
|null|  177|
|62.0|    4|
|18.0|   26|
|80.0|    1|
|34.5|    1|
|39.0|   14|
| 1.0|    7|
+----+-----+
only showing top 20 rows



In [35]:
# Podemos hacer fill a una columna especifica

titanic.na.fill(value = 9999, subset = ["Age"]).limit(6).toPandas()

Unnamed: 0,PassengerId,Name,Sex,Age,Survived,Pclass,Ticket,Fare
0,1,"Braund, Mr. Owen Harris",male,22.0,0,3,A/5 21171,7.25
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,1,PC 17599,71.2833
2,3,"Heikkinen, Miss. Laina",female,26.0,1,3,STON/O2. 3101282,7.925
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,1,113803,53.1
4,5,"Allen, Mr. William Henry",male,35.0,0,3,373450,8.05
5,6,"Moran, Mr. James",male,9999.0,0,3,330877,8.4583


In [36]:
# En una linea

titanic.filter(titanic.Age.isNull()).na.fill(value = 9999, subset = ["Age"]).limit(5).toPandas()

Unnamed: 0,PassengerId,Name,Sex,Age,Survived,Pclass,Ticket,Fare
0,6,"Moran, Mr. James",male,9999.0,0,3,330877,8.4583
1,18,"Williams, Mr. Charles Eugene",male,9999.0,1,2,244373,13.0
2,20,"Masselmani, Mrs. Fatima",female,9999.0,1,3,2649,7.225
3,27,"Emir, Mr. Farred Chehab",male,9999.0,0,3,2631,7.225
4,29,"""O'Dwyer, Miss. Ellen """"Nellie""""""",female,9999.0,1,3,330959,7.8792


In [42]:
# Cambia los NaN's por el promedio de la columna

def fill_with_mean(df, include = set()):
    stats = df.agg(*(avg(c).alias(c) for c in df.columns if c in include))
    
    return df.na.fill(value = stats.first().asDict())

In [43]:
updated_df = fill_with_mean(titanic, ["Age"])

In [39]:
# fila 6
updated_df.limit(6).toPandas()

Unnamed: 0,PassengerId,Name,Sex,Age,Survived,Pclass,Ticket,Fare
0,1,"Braund, Mr. Owen Harris",male,22.0,0,3,A/5 21171,7.25
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,1,PC 17599,71.2833
2,3,"Heikkinen, Miss. Laina",female,26.0,1,3,STON/O2. 3101282,7.925
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,1,113803,53.1
4,5,"Allen, Mr. William Henry",male,35.0,0,3,373450,8.05
5,6,"Moran, Mr. James",male,29.699118,0,3,330877,8.4583


In [46]:
updated_df = (
    updated_df
    .withColumn('Age_rounded', round(col('Age'),0).cast(IntegerType()))
)
updated_df.toPandas().head(10)

Unnamed: 0,PassengerId,Name,Sex,Age,Survived,Pclass,Ticket,Fare,Age_rounded
0,1,"Braund, Mr. Owen Harris",male,22.0,0,3,A/5 21171,7.25,22
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,1,PC 17599,71.2833,38
2,3,"Heikkinen, Miss. Laina",female,26.0,1,3,STON/O2. 3101282,7.925,26
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,1,113803,53.1,35
4,5,"Allen, Mr. William Henry",male,35.0,0,3,373450,8.05,35
5,6,"Moran, Mr. James",male,29.699118,0,3,330877,8.4583,30
6,7,"McCarthy, Mr. Timothy J",male,54.0,0,1,17463,51.8625,54
7,8,"Palsson, Master. Gosta Leonard",male,2.0,0,3,349909,21.075,2
8,9,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,1,3,347742,11.1333,27
9,10,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,2,237736,30.0708,14


In [None]:
################################################################################################################################