<a href="https://colab.research.google.com/github/gabgovar/Apache-Spark/blob/main/Spark_DataFrames.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spark DataFrame

## Montando o Google Colab no Google Drive

In [13]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Instalando as Demendências do Hadoop Spark no Google Colab

In [3]:
# instalar as dependências
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark

## Configurando as Demendências do Hadoop Spark no Google Colab

In [4]:
# configurar as variáveis de ambiente
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"

# tornar o pyspark "importável"
import findspark
findspark.init('spark-2.4.4-bin-hadoop2.7')
findspark.find()

'spark-2.4.4-bin-hadoop2.7/python/pyspark'

## Leitura de um DataFrame pelo Spark

In [36]:
from pyspark.sql import SparkSession

spark = SparkSession\
        .builder\
        .appName("Spark_DataFrames")\
        .getOrCreate()

* A primeira linha são as informações do DF, então o .option("header", True) insere o cabeçalho no DF

In [37]:
df = spark.read.option("header", True).csv('/content/drive/MyDrive/PySpark/data/StudentData.csv')
df.show()

+---+------+----------------+------+------+-----+--------------------+
|age|gender|            name|course|  roll|marks|               email|
+---+------+----------------+------+------+-----+--------------------+
| 28|Female| Hubert Oliveras|    DB| 02984|   59|Annika Hoffman_Na...|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   62|Margene Moores_Ma...|
| 28|  Male|  Celeste Lollis|    PF| 21267|   45|Jeannetta Golden_...|
| 29|Female|    Elenore Choy|    DB| 32877|   29|Billi Clore_Mitzi...|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   41|Claude Panos_Judi...|
| 28|  Male|  Margene Moores|   MVC| 52771|   32|Toshiko Hillyard_...|
| 28|  Male|     Neda Briski|   OOP| 61973|   69|Alberta Freund_El...|
| 28|Female|    Claude Panos| Cloud| 72409|   85|Sheryll Towler_Al...|
| 28|  Male|  Celeste Lollis|   MVC| 81492|   64|Nicole Harwood_Cl...|
| 29|  Male|  Cordie Harnois|   OOP| 92882|   51|Judie Chipps_Clem...|
| 29|Female|       Kena Wild|   DSA|102285|   35|Dustin Feagins_Ma...|
| 29| 

## Schema de um DataFrame

* option("inferSchema", True) ⇒ define automáticamente o Type do dado no interior do DF

* se estiver utilizando um delimitador diferente da virgula (,), por exemplo tab (tsv, separado por tab), dentro do options inserir o comando .option(delimiter = '\t')

In [23]:
df = spark.read.options(inferSchema='True', header = 'True', delimiter = ',').csv('/content/drive/MyDrive/PySpark/data/StudentData.csv')

In [24]:
df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- name: string (nullable = true)
 |-- course: string (nullable = true)
 |-- roll: integer (nullable = true)
 |-- marks: integer (nullable = true)
 |-- email: string (nullable = true)



In [25]:
df.show()

+---+------+----------------+------+------+-----+--------------------+
|age|gender|            name|course|  roll|marks|               email|
+---+------+----------------+------+------+-----+--------------------+
| 28|Female| Hubert Oliveras|    DB|  2984|   59|Annika Hoffman_Na...|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   62|Margene Moores_Ma...|
| 28|  Male|  Celeste Lollis|    PF| 21267|   45|Jeannetta Golden_...|
| 29|Female|    Elenore Choy|    DB| 32877|   29|Billi Clore_Mitzi...|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   41|Claude Panos_Judi...|
| 28|  Male|  Margene Moores|   MVC| 52771|   32|Toshiko Hillyard_...|
| 28|  Male|     Neda Briski|   OOP| 61973|   69|Alberta Freund_El...|
| 28|Female|    Claude Panos| Cloud| 72409|   85|Sheryll Towler_Al...|
| 28|  Male|  Celeste Lollis|   MVC| 81492|   64|Nicole Harwood_Cl...|
| 29|  Male|  Cordie Harnois|   OOP| 92882|   51|Judie Chipps_Clem...|
| 29|Female|       Kena Wild|   DSA|102285|   35|Dustin Feagins_Ma...|
| 29| 

## Fornecendo o Schema DataFrame

In [26]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
                     StructField("age", IntegerType(), True),
                     StructField("gender", StringType(), True),
                     StructField("name", StringType(), True),
                     StructField("course", StringType(), True),
                     StructField("roll", StringType(), True),
                     StructField("marks", IntegerType(), True),
                     StructField("email", StringType(), True)
])

In [29]:
df = spark.read.options(header = 'True').schema(schema).csv('/content/drive/MyDrive/PySpark/data/StudentData.csv')
df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- name: string (nullable = true)
 |-- course: string (nullable = true)
 |-- roll: string (nullable = true)
 |-- marks: integer (nullable = true)
 |-- email: string (nullable = true)



## Criando um DF a partir de um RDD

In [45]:
from pyspark.sql import SparkSession
spark = SparkSession\
        .builder\
        .appName("Spark DataFrame")\
        .get0rCreate()

AttributeError: ignored

In [52]:
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName("RDD")
sc = SparkContext.getOrCreate(conf=conf)

rdd = sc.textFile('/content/drive/MyDrive/PySpark/data/StudentData.csv')
headers = rdd.first()

rdd = rdd.filter(lambda x: x!= headers).map(lambda x: x.split(','))
rdd = rdd.map(lambda x: [int(x[0]), x[1], x[2], x[3], x[4], int(x[5]), x[6]])

In [53]:
columns = headers.split(",")
dfRdd = rdd.toDF(columns)
dfRdd.show()

+---+------+----------------+------+------+-----+--------------------+
|age|gender|            name|course|  roll|marks|               email|
+---+------+----------------+------+------+-----+--------------------+
| 28|Female| Hubert Oliveras|    DB| 02984|   59|Annika Hoffman_Na...|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   62|Margene Moores_Ma...|
| 28|  Male|  Celeste Lollis|    PF| 21267|   45|Jeannetta Golden_...|
| 29|Female|    Elenore Choy|    DB| 32877|   29|Billi Clore_Mitzi...|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   41|Claude Panos_Judi...|
| 28|  Male|  Margene Moores|   MVC| 52771|   32|Toshiko Hillyard_...|
| 28|  Male|     Neda Briski|   OOP| 61973|   69|Alberta Freund_El...|
| 28|Female|    Claude Panos| Cloud| 72409|   85|Sheryll Towler_Al...|
| 28|  Male|  Celeste Lollis|   MVC| 81492|   64|Nicole Harwood_Cl...|
| 29|  Male|  Cordie Harnois|   OOP| 92882|   51|Judie Chipps_Clem...|
| 29|Female|       Kena Wild|   DSA|102285|   35|Dustin Feagins_Ma...|
| 29| 

In [54]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
                     StructField("age", IntegerType(), True),
                     StructField("gender", StringType(), True),
                     StructField("name", StringType(), True),
                     StructField("course", StringType(), True),
                     StructField("roll", StringType(), True),
                     StructField("marks", IntegerType(), True),
                     StructField("email", StringType(), True)
])

In [55]:
dfRdd2 = spark.createDataFrame(rdd, schema=schema)
dfRdd2.show()
dfRdd2.printSchema()

+---+------+----------------+------+------+-----+--------------------+
|age|gender|            name|course|  roll|marks|               email|
+---+------+----------------+------+------+-----+--------------------+
| 28|Female| Hubert Oliveras|    DB| 02984|   59|Annika Hoffman_Na...|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   62|Margene Moores_Ma...|
| 28|  Male|  Celeste Lollis|    PF| 21267|   45|Jeannetta Golden_...|
| 29|Female|    Elenore Choy|    DB| 32877|   29|Billi Clore_Mitzi...|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   41|Claude Panos_Judi...|
| 28|  Male|  Margene Moores|   MVC| 52771|   32|Toshiko Hillyard_...|
| 28|  Male|     Neda Briski|   OOP| 61973|   69|Alberta Freund_El...|
| 28|Female|    Claude Panos| Cloud| 72409|   85|Sheryll Towler_Al...|
| 28|  Male|  Celeste Lollis|   MVC| 81492|   64|Nicole Harwood_Cl...|
| 29|  Male|  Cordie Harnois|   OOP| 92882|   51|Judie Chipps_Clem...|
| 29|Female|       Kena Wild|   DSA|102285|   35|Dustin Feagins_Ma...|
| 29| 

## Select colunas do DataFrame

In [56]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark DataFrames").getOrCreate()


In [58]:
df = spark.read.options(header = 'True', inferSchema = 'True').csv('/content/drive/MyDrive/PySpark/data/StudentData.csv')
df.show()

+---+------+----------------+------+------+-----+--------------------+
|age|gender|            name|course|  roll|marks|               email|
+---+------+----------------+------+------+-----+--------------------+
| 28|Female| Hubert Oliveras|    DB|  2984|   59|Annika Hoffman_Na...|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   62|Margene Moores_Ma...|
| 28|  Male|  Celeste Lollis|    PF| 21267|   45|Jeannetta Golden_...|
| 29|Female|    Elenore Choy|    DB| 32877|   29|Billi Clore_Mitzi...|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   41|Claude Panos_Judi...|
| 28|  Male|  Margene Moores|   MVC| 52771|   32|Toshiko Hillyard_...|
| 28|  Male|     Neda Briski|   OOP| 61973|   69|Alberta Freund_El...|
| 28|Female|    Claude Panos| Cloud| 72409|   85|Sheryll Towler_Al...|
| 28|  Male|  Celeste Lollis|   MVC| 81492|   64|Nicole Harwood_Cl...|
| 29|  Male|  Cordie Harnois|   OOP| 92882|   51|Judie Chipps_Clem...|
| 29|Female|       Kena Wild|   DSA|102285|   35|Dustin Feagins_Ma...|
| 29| 

* Como selecionar colunas no DF

In [61]:
df.select("name","gender").show()

+----------------+------+
|            name|gender|
+----------------+------+
| Hubert Oliveras|Female|
|Toshiko Hillyard|Female|
|  Celeste Lollis|  Male|
|    Elenore Choy|Female|
|  Sheryll Towler|  Male|
|  Margene Moores|  Male|
|     Neda Briski|  Male|
|    Claude Panos|Female|
|  Celeste Lollis|  Male|
|  Cordie Harnois|  Male|
|       Kena Wild|Female|
| Ernest Rossbach|  Male|
|  Latia Vanhoose|Female|
|  Latia Vanhoose|Female|
|     Neda Briski|  Male|
|  Latia Vanhoose|Female|
|  Loris Crossett|  Male|
|  Annika Hoffman|  Male|
|   Santa Kerfien|  Male|
|Mickey Cortright|Female|
+----------------+------+
only showing top 20 rows



In [62]:
df.select(df.name, df.email).show()

+----------------+--------------------+
|            name|               email|
+----------------+--------------------+
| Hubert Oliveras|Annika Hoffman_Na...|
|Toshiko Hillyard|Margene Moores_Ma...|
|  Celeste Lollis|Jeannetta Golden_...|
|    Elenore Choy|Billi Clore_Mitzi...|
|  Sheryll Towler|Claude Panos_Judi...|
|  Margene Moores|Toshiko Hillyard_...|
|     Neda Briski|Alberta Freund_El...|
|    Claude Panos|Sheryll Towler_Al...|
|  Celeste Lollis|Nicole Harwood_Cl...|
|  Cordie Harnois|Judie Chipps_Clem...|
|       Kena Wild|Dustin Feagins_Ma...|
| Ernest Rossbach|Maybell Duguay_Ab...|
|  Latia Vanhoose|Latia Vanhoose_Mi...|
|  Latia Vanhoose|Eda Neathery_Nico...|
|     Neda Briski|Margene Moores_Mi...|
|  Latia Vanhoose|Claude Panos_Sant...|
|  Loris Crossett|Mitzi Seldon_Jenn...|
|  Annika Hoffman|Taryn Brownlee_Mi...|
|   Santa Kerfien|Judie Chipps_Tary...|
|Mickey Cortright|Ernest Rossbach_M...|
+----------------+--------------------+
only showing top 20 rows



In [63]:
from pyspark.sql.functions import col

df.select(col("roll"), col("name")).show()

+------+----------------+
|  roll|            name|
+------+----------------+
|  2984| Hubert Oliveras|
| 12899|Toshiko Hillyard|
| 21267|  Celeste Lollis|
| 32877|    Elenore Choy|
| 41487|  Sheryll Towler|
| 52771|  Margene Moores|
| 61973|     Neda Briski|
| 72409|    Claude Panos|
| 81492|  Celeste Lollis|
| 92882|  Cordie Harnois|
|102285|       Kena Wild|
|111449| Ernest Rossbach|
|122502|  Latia Vanhoose|
|132110|  Latia Vanhoose|
|141770|     Neda Briski|
|152159|  Latia Vanhoose|
|161771|  Loris Crossett|
|171660|  Annika Hoffman|
|182129|   Santa Kerfien|
|192537|Mickey Cortright|
+------+----------------+
only showing top 20 rows



In [64]:
df.select('*').show()

+---+------+----------------+------+------+-----+--------------------+
|age|gender|            name|course|  roll|marks|               email|
+---+------+----------------+------+------+-----+--------------------+
| 28|Female| Hubert Oliveras|    DB|  2984|   59|Annika Hoffman_Na...|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   62|Margene Moores_Ma...|
| 28|  Male|  Celeste Lollis|    PF| 21267|   45|Jeannetta Golden_...|
| 29|Female|    Elenore Choy|    DB| 32877|   29|Billi Clore_Mitzi...|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   41|Claude Panos_Judi...|
| 28|  Male|  Margene Moores|   MVC| 52771|   32|Toshiko Hillyard_...|
| 28|  Male|     Neda Briski|   OOP| 61973|   69|Alberta Freund_El...|
| 28|Female|    Claude Panos| Cloud| 72409|   85|Sheryll Towler_Al...|
| 28|  Male|  Celeste Lollis|   MVC| 81492|   64|Nicole Harwood_Cl...|
| 29|  Male|  Cordie Harnois|   OOP| 92882|   51|Judie Chipps_Clem...|
| 29|Female|       Kena Wild|   DSA|102285|   35|Dustin Feagins_Ma...|
| 29| 

In [68]:
df.columns
df.select('age', 'gender', 'email').show()

+---+------+--------------------+
|age|gender|               email|
+---+------+--------------------+
| 28|Female|Annika Hoffman_Na...|
| 29|Female|Margene Moores_Ma...|
| 28|  Male|Jeannetta Golden_...|
| 29|Female|Billi Clore_Mitzi...|
| 28|  Male|Claude Panos_Judi...|
| 28|  Male|Toshiko Hillyard_...|
| 28|  Male|Alberta Freund_El...|
| 28|Female|Sheryll Towler_Al...|
| 28|  Male|Nicole Harwood_Cl...|
| 29|  Male|Judie Chipps_Clem...|
| 29|Female|Dustin Feagins_Ma...|
| 29|  Male|Maybell Duguay_Ab...|
| 28|Female|Latia Vanhoose_Mi...|
| 29|Female|Eda Neathery_Nico...|
| 29|  Male|Margene Moores_Mi...|
| 29|Female|Claude Panos_Sant...|
| 29|  Male|Mitzi Seldon_Jenn...|
| 29|  Male|Taryn Brownlee_Mi...|
| 29|  Male|Judie Chipps_Tary...|
| 28|Female|Ernest Rossbach_M...|
+---+------+--------------------+
only showing top 20 rows



In [73]:
df.select(df.columns[2:6]).show()

+----------------+------+------+-----+
|            name|course|  roll|marks|
+----------------+------+------+-----+
| Hubert Oliveras|    DB|  2984|   59|
|Toshiko Hillyard| Cloud| 12899|   62|
|  Celeste Lollis|    PF| 21267|   45|
|    Elenore Choy|    DB| 32877|   29|
|  Sheryll Towler|   DSA| 41487|   41|
|  Margene Moores|   MVC| 52771|   32|
|     Neda Briski|   OOP| 61973|   69|
|    Claude Panos| Cloud| 72409|   85|
|  Celeste Lollis|   MVC| 81492|   64|
|  Cordie Harnois|   OOP| 92882|   51|
|       Kena Wild|   DSA|102285|   35|
| Ernest Rossbach|    DB|111449|   53|
|  Latia Vanhoose|    DB|122502|   27|
|  Latia Vanhoose|   MVC|132110|   55|
|     Neda Briski|    PF|141770|   42|
|  Latia Vanhoose|    DB|152159|   27|
|  Loris Crossett|   MVC|161771|   36|
|  Annika Hoffman|   OOP|171660|   22|
|   Santa Kerfien|    PF|182129|   56|
|Mickey Cortright|    DB|192537|   62|
+----------------+------+------+-----+
only showing top 20 rows



In [77]:
df2 = df.select(col("roll"), col("name"))

In [78]:
df2.show()

+------+----------------+
|  roll|            name|
+------+----------------+
|  2984| Hubert Oliveras|
| 12899|Toshiko Hillyard|
| 21267|  Celeste Lollis|
| 32877|    Elenore Choy|
| 41487|  Sheryll Towler|
| 52771|  Margene Moores|
| 61973|     Neda Briski|
| 72409|    Claude Panos|
| 81492|  Celeste Lollis|
| 92882|  Cordie Harnois|
|102285|       Kena Wild|
|111449| Ernest Rossbach|
|122502|  Latia Vanhoose|
|132110|  Latia Vanhoose|
|141770|     Neda Briski|
|152159|  Latia Vanhoose|
|161771|  Loris Crossett|
|171660|  Annika Hoffman|
|182129|   Santa Kerfien|
|192537|Mickey Cortright|
+------+----------------+
only showing top 20 rows



## with column no DataFrame

In [79]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark DataFrames").getOrCreate()
df = spark.read.options(header = 'True', inferSchema = 'True').csv('/content/drive/MyDrive/PySpark/data/StudentData.csv')

In [80]:
df = spark.read.options(header = 'True', inferSchema = 'True').csv('/content/drive/MyDrive/PySpark/data/StudentData.csv')
df.show()

+---+------+----------------+------+------+-----+--------------------+
|age|gender|            name|course|  roll|marks|               email|
+---+------+----------------+------+------+-----+--------------------+
| 28|Female| Hubert Oliveras|    DB|  2984|   59|Annika Hoffman_Na...|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   62|Margene Moores_Ma...|
| 28|  Male|  Celeste Lollis|    PF| 21267|   45|Jeannetta Golden_...|
| 29|Female|    Elenore Choy|    DB| 32877|   29|Billi Clore_Mitzi...|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   41|Claude Panos_Judi...|
| 28|  Male|  Margene Moores|   MVC| 52771|   32|Toshiko Hillyard_...|
| 28|  Male|     Neda Briski|   OOP| 61973|   69|Alberta Freund_El...|
| 28|Female|    Claude Panos| Cloud| 72409|   85|Sheryll Towler_Al...|
| 28|  Male|  Celeste Lollis|   MVC| 81492|   64|Nicole Harwood_Cl...|
| 29|  Male|  Cordie Harnois|   OOP| 92882|   51|Judie Chipps_Clem...|
| 29|Female|       Kena Wild|   DSA|102285|   35|Dustin Feagins_Ma...|
| 29| 

In [101]:
from pyspark.sql.functions import col, lit
df = df.withColumn("roll", col("roll").cast("string"))

In [94]:
df.show()

+---+------+----------------+------+------+-----+--------------------+
|age|gender|            name|course|  roll|marks|               email|
+---+------+----------------+------+------+-----+--------------------+
| 28|Female| Hubert Oliveras|    DB|  2984|   59|Annika Hoffman_Na...|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   62|Margene Moores_Ma...|
| 28|  Male|  Celeste Lollis|    PF| 21267|   45|Jeannetta Golden_...|
| 29|Female|    Elenore Choy|    DB| 32877|   29|Billi Clore_Mitzi...|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   41|Claude Panos_Judi...|
| 28|  Male|  Margene Moores|   MVC| 52771|   32|Toshiko Hillyard_...|
| 28|  Male|     Neda Briski|   OOP| 61973|   69|Alberta Freund_El...|
| 28|Female|    Claude Panos| Cloud| 72409|   85|Sheryll Towler_Al...|
| 28|  Male|  Celeste Lollis|   MVC| 81492|   64|Nicole Harwood_Cl...|
| 29|  Male|  Cordie Harnois|   OOP| 92882|   51|Judie Chipps_Clem...|
| 29|Female|       Kena Wild|   DSA|102285|   35|Dustin Feagins_Ma...|
| 29| 

In [95]:
df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- name: string (nullable = true)
 |-- course: string (nullable = true)
 |-- roll: string (nullable = true)
 |-- marks: integer (nullable = true)
 |-- email: string (nullable = true)



In [98]:
# Adicionando 10 a coluna marks
df = df.withColumn("marks", col('marks') + 10)
df.show()

+---+------+----------------+------+------+-----+--------------------+
|age|gender|            name|course|  roll|marks|               email|
+---+------+----------------+------+------+-----+--------------------+
| 28|Female| Hubert Oliveras|    DB|  2984|   69|Annika Hoffman_Na...|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   72|Margene Moores_Ma...|
| 28|  Male|  Celeste Lollis|    PF| 21267|   55|Jeannetta Golden_...|
| 29|Female|    Elenore Choy|    DB| 32877|   39|Billi Clore_Mitzi...|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   51|Claude Panos_Judi...|
| 28|  Male|  Margene Moores|   MVC| 52771|   42|Toshiko Hillyard_...|
| 28|  Male|     Neda Briski|   OOP| 61973|   79|Alberta Freund_El...|
| 28|Female|    Claude Panos| Cloud| 72409|   95|Sheryll Towler_Al...|
| 28|  Male|  Celeste Lollis|   MVC| 81492|   74|Nicole Harwood_Cl...|
| 29|  Male|  Cordie Harnois|   OOP| 92882|   61|Judie Chipps_Clem...|
| 29|Female|       Kena Wild|   DSA|102285|   45|Dustin Feagins_Ma...|
| 29| 

In [108]:
#Criando uma nova coluna
df = df.withColumn("aggregated marks", col('marks') -10)
df.show()

+---+------+----------------+------+------+-----+--------------------+----------------+-------+
|age|gender|            name|course|  roll|marks|               email|aggregated marks|Country|
+---+------+----------------+------+------+-----+--------------------+----------------+-------+
| 28|Female| Hubert Oliveras|    DB|  2984|   59|Annika Hoffman_Na...|              49|    USA|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   62|Margene Moores_Ma...|              52|    USA|
| 28|  Male|  Celeste Lollis|    PF| 21267|   45|Jeannetta Golden_...|              35|    USA|
| 29|Female|    Elenore Choy|    DB| 32877|   29|Billi Clore_Mitzi...|              19|    USA|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   41|Claude Panos_Judi...|              31|    USA|
| 28|  Male|  Margene Moores|   MVC| 52771|   32|Toshiko Hillyard_...|              22|    USA|
| 28|  Male|     Neda Briski|   OOP| 61973|   69|Alberta Freund_El...|              59|    USA|
| 28|Female|    Claude Panos| Cloud| 724

In [104]:
from pyspark.sql.functions import col, lit
df = df.withColumn("Country", lit("USA"))
df.show()

+---+------+----------------+------+------+-----+--------------------+----------------+-------+
|age|gender|            name|course|  roll|marks|               email|aggregated marks|Country|
+---+------+----------------+------+------+-----+--------------------+----------------+-------+
| 28|Female| Hubert Oliveras|    DB|  2984|   59|Annika Hoffman_Na...|              49|    USA|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   62|Margene Moores_Ma...|              52|    USA|
| 28|  Male|  Celeste Lollis|    PF| 21267|   45|Jeannetta Golden_...|              35|    USA|
| 29|Female|    Elenore Choy|    DB| 32877|   29|Billi Clore_Mitzi...|              19|    USA|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   41|Claude Panos_Judi...|              31|    USA|
| 28|  Male|  Margene Moores|   MVC| 52771|   32|Toshiko Hillyard_...|              22|    USA|
| 28|  Male|     Neda Briski|   OOP| 61973|   69|Alberta Freund_El...|              59|    USA|
| 28|Female|    Claude Panos| Cloud| 724

In [110]:
df.withColumn("marks", col("marks")- 10).withColumn("updated marks", col("marks") + 20).withColumn("Country", lit("USA")).show()

+---+------+----------------+------+------+-----+--------------------+----------------+-------+-------------+
|age|gender|            name|course|  roll|marks|               email|aggregated marks|Country|updated marks|
+---+------+----------------+------+------+-----+--------------------+----------------+-------+-------------+
| 28|Female| Hubert Oliveras|    DB|  2984|   49|Annika Hoffman_Na...|              49|    USA|           69|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   52|Margene Moores_Ma...|              52|    USA|           72|
| 28|  Male|  Celeste Lollis|    PF| 21267|   35|Jeannetta Golden_...|              35|    USA|           55|
| 29|Female|    Elenore Choy|    DB| 32877|   19|Billi Clore_Mitzi...|              19|    USA|           39|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   31|Claude Panos_Judi...|              31|    USA|           51|
| 28|  Male|  Margene Moores|   MVC| 52771|   22|Toshiko Hillyard_...|              22|    USA|           42|
| 28|  Mal

### Spark DF withcolumn renomendo e alias

In [5]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark DataFrames").getOrCreate()
df = spark.read.options(header = 'True', inferSchema = 'True').csv('/content/drive/MyDrive/PySpark/data/StudentData.csv')
df.show()

+---+------+----------------+------+------+-----+--------------------+
|age|gender|            name|course|  roll|marks|               email|
+---+------+----------------+------+------+-----+--------------------+
| 28|Female| Hubert Oliveras|    DB|  2984|   59|Annika Hoffman_Na...|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   62|Margene Moores_Ma...|
| 28|  Male|  Celeste Lollis|    PF| 21267|   45|Jeannetta Golden_...|
| 29|Female|    Elenore Choy|    DB| 32877|   29|Billi Clore_Mitzi...|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   41|Claude Panos_Judi...|
| 28|  Male|  Margene Moores|   MVC| 52771|   32|Toshiko Hillyard_...|
| 28|  Male|     Neda Briski|   OOP| 61973|   69|Alberta Freund_El...|
| 28|Female|    Claude Panos| Cloud| 72409|   85|Sheryll Towler_Al...|
| 28|  Male|  Celeste Lollis|   MVC| 81492|   64|Nicole Harwood_Cl...|
| 29|  Male|  Cordie Harnois|   OOP| 92882|   51|Judie Chipps_Clem...|
| 29|Female|       Kena Wild|   DSA|102285|   35|Dustin Feagins_Ma...|
| 29| 

In [7]:
#renomenando uma coluna
df = df.withColumnRenamed("gender","sex").withColumnRenamed("roll", "Roll namber")
df.show()

+---+------+----------------+------+-----------+-----+--------------------+
|age|   sex|            name|course|Roll namber|marks|               email|
+---+------+----------------+------+-----------+-----+--------------------+
| 28|Female| Hubert Oliveras|    DB|       2984|   59|Annika Hoffman_Na...|
| 29|Female|Toshiko Hillyard| Cloud|      12899|   62|Margene Moores_Ma...|
| 28|  Male|  Celeste Lollis|    PF|      21267|   45|Jeannetta Golden_...|
| 29|Female|    Elenore Choy|    DB|      32877|   29|Billi Clore_Mitzi...|
| 28|  Male|  Sheryll Towler|   DSA|      41487|   41|Claude Panos_Judi...|
| 28|  Male|  Margene Moores|   MVC|      52771|   32|Toshiko Hillyard_...|
| 28|  Male|     Neda Briski|   OOP|      61973|   69|Alberta Freund_El...|
| 28|Female|    Claude Panos| Cloud|      72409|   85|Sheryll Towler_Al...|
| 28|  Male|  Celeste Lollis|   MVC|      81492|   64|Nicole Harwood_Cl...|
| 29|  Male|  Cordie Harnois|   OOP|      92882|   51|Judie Chipps_Clem...|
| 29|Female|

* utlizando o alias

In [9]:
from pyspark.sql.functions import col, lit
df.select(col("name").alias("Full name")).show()

+----------------+
|       Full name|
+----------------+
| Hubert Oliveras|
|Toshiko Hillyard|
|  Celeste Lollis|
|    Elenore Choy|
|  Sheryll Towler|
|  Margene Moores|
|     Neda Briski|
|    Claude Panos|
|  Celeste Lollis|
|  Cordie Harnois|
|       Kena Wild|
| Ernest Rossbach|
|  Latia Vanhoose|
|  Latia Vanhoose|
|     Neda Briski|
|  Latia Vanhoose|
|  Loris Crossett|
|  Annika Hoffman|
|   Santa Kerfien|
|Mickey Cortright|
+----------------+
only showing top 20 rows



## withColumnRenamed no DataFrame

In [10]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit
spark = SparkSession.builder.appName("Spark DataFrames").getOrCreate()
df = spark.read.options(header = 'True', inferSchema = 'True').csv('/content/drive/MyDrive/PySpark/data/StudentData.csv')
df.show()

+---+------+----------------+------+------+-----+--------------------+
|age|gender|            name|course|  roll|marks|               email|
+---+------+----------------+------+------+-----+--------------------+
| 28|Female| Hubert Oliveras|    DB|  2984|   59|Annika Hoffman_Na...|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   62|Margene Moores_Ma...|
| 28|  Male|  Celeste Lollis|    PF| 21267|   45|Jeannetta Golden_...|
| 29|Female|    Elenore Choy|    DB| 32877|   29|Billi Clore_Mitzi...|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   41|Claude Panos_Judi...|
| 28|  Male|  Margene Moores|   MVC| 52771|   32|Toshiko Hillyard_...|
| 28|  Male|     Neda Briski|   OOP| 61973|   69|Alberta Freund_El...|
| 28|Female|    Claude Panos| Cloud| 72409|   85|Sheryll Towler_Al...|
| 28|  Male|  Celeste Lollis|   MVC| 81492|   64|Nicole Harwood_Cl...|
| 29|  Male|  Cordie Harnois|   OOP| 92882|   51|Judie Chipps_Clem...|
| 29|Female|       Kena Wild|   DSA|102285|   35|Dustin Feagins_Ma...|
| 29| 

In [12]:
df = df.withColumnRenamed("gender", "sex").withColumnRenamed("roll", "roll number")
df.show()

+---+------+----------------+------+-----------+-----+--------------------+
|age|   sex|            name|course|roll number|marks|               email|
+---+------+----------------+------+-----------+-----+--------------------+
| 28|Female| Hubert Oliveras|    DB|       2984|   59|Annika Hoffman_Na...|
| 29|Female|Toshiko Hillyard| Cloud|      12899|   62|Margene Moores_Ma...|
| 28|  Male|  Celeste Lollis|    PF|      21267|   45|Jeannetta Golden_...|
| 29|Female|    Elenore Choy|    DB|      32877|   29|Billi Clore_Mitzi...|
| 28|  Male|  Sheryll Towler|   DSA|      41487|   41|Claude Panos_Judi...|
| 28|  Male|  Margene Moores|   MVC|      52771|   32|Toshiko Hillyard_...|
| 28|  Male|     Neda Briski|   OOP|      61973|   69|Alberta Freund_El...|
| 28|Female|    Claude Panos| Cloud|      72409|   85|Sheryll Towler_Al...|
| 28|  Male|  Celeste Lollis|   MVC|      81492|   64|Nicole Harwood_Cl...|
| 29|  Male|  Cordie Harnois|   OOP|      92882|   51|Judie Chipps_Clem...|
| 29|Female|

In [14]:
df.select(col("name").alias("Full Name")).show()

+----------------+
|       Full Name|
+----------------+
| Hubert Oliveras|
|Toshiko Hillyard|
|  Celeste Lollis|
|    Elenore Choy|
|  Sheryll Towler|
|  Margene Moores|
|     Neda Briski|
|    Claude Panos|
|  Celeste Lollis|
|  Cordie Harnois|
|       Kena Wild|
| Ernest Rossbach|
|  Latia Vanhoose|
|  Latia Vanhoose|
|     Neda Briski|
|  Latia Vanhoose|
|  Loris Crossett|
|  Annika Hoffman|
|   Santa Kerfien|
|Mickey Cortright|
+----------------+
only showing top 20 rows



## filter / where no DataFrame

In [15]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit
spark = SparkSession.builder.appName("Spark DataFrames").getOrCreate()
df = spark.read.options(header = 'True', inferSchema = 'True').csv('/content/drive/MyDrive/PySpark/data/StudentData.csv')
df.show()

+---+------+----------------+------+------+-----+--------------------+
|age|gender|            name|course|  roll|marks|               email|
+---+------+----------------+------+------+-----+--------------------+
| 28|Female| Hubert Oliveras|    DB|  2984|   59|Annika Hoffman_Na...|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   62|Margene Moores_Ma...|
| 28|  Male|  Celeste Lollis|    PF| 21267|   45|Jeannetta Golden_...|
| 29|Female|    Elenore Choy|    DB| 32877|   29|Billi Clore_Mitzi...|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   41|Claude Panos_Judi...|
| 28|  Male|  Margene Moores|   MVC| 52771|   32|Toshiko Hillyard_...|
| 28|  Male|     Neda Briski|   OOP| 61973|   69|Alberta Freund_El...|
| 28|Female|    Claude Panos| Cloud| 72409|   85|Sheryll Towler_Al...|
| 28|  Male|  Celeste Lollis|   MVC| 81492|   64|Nicole Harwood_Cl...|
| 29|  Male|  Cordie Harnois|   OOP| 92882|   51|Judie Chipps_Clem...|
| 29|Female|       Kena Wild|   DSA|102285|   35|Dustin Feagins_Ma...|
| 29| 

In [17]:
df.filter(df.course == "DB").show()

+---+------+-----------------+------+-------+-----+--------------------+
|age|gender|             name|course|   roll|marks|               email|
+---+------+-----------------+------+-------+-----+--------------------+
| 28|Female|  Hubert Oliveras|    DB|   2984|   59|Annika Hoffman_Na...|
| 29|Female|     Elenore Choy|    DB|  32877|   29|Billi Clore_Mitzi...|
| 29|  Male|  Ernest Rossbach|    DB| 111449|   53|Maybell Duguay_Ab...|
| 28|Female|   Latia Vanhoose|    DB| 122502|   27|Latia Vanhoose_Mi...|
| 29|Female|   Latia Vanhoose|    DB| 152159|   27|Claude Panos_Sant...|
| 28|Female| Mickey Cortright|    DB| 192537|   62|Ernest Rossbach_M...|
| 28|Female|      Anna Santos|    DB| 311589|   79|Celeste Lollis_Mi...|
| 28|  Male|    Kizzy Brenner|    DB| 381712|   36|Paris Hutton_Kena...|
| 28|  Male| Toshiko Hillyard|    DB| 392218|   47|Leontine Phillips...|
| 29|  Male|     Paris Hutton|    DB| 481229|   57|Clementina Menke_...|
| 28|Female| Mickey Cortright|    DB| 551389|   43|

In [18]:
df.filter(col("course") == "DB").show()

+---+------+-----------------+------+-------+-----+--------------------+
|age|gender|             name|course|   roll|marks|               email|
+---+------+-----------------+------+-------+-----+--------------------+
| 28|Female|  Hubert Oliveras|    DB|   2984|   59|Annika Hoffman_Na...|
| 29|Female|     Elenore Choy|    DB|  32877|   29|Billi Clore_Mitzi...|
| 29|  Male|  Ernest Rossbach|    DB| 111449|   53|Maybell Duguay_Ab...|
| 28|Female|   Latia Vanhoose|    DB| 122502|   27|Latia Vanhoose_Mi...|
| 29|Female|   Latia Vanhoose|    DB| 152159|   27|Claude Panos_Sant...|
| 28|Female| Mickey Cortright|    DB| 192537|   62|Ernest Rossbach_M...|
| 28|Female|      Anna Santos|    DB| 311589|   79|Celeste Lollis_Mi...|
| 28|  Male|    Kizzy Brenner|    DB| 381712|   36|Paris Hutton_Kena...|
| 28|  Male| Toshiko Hillyard|    DB| 392218|   47|Leontine Phillips...|
| 29|  Male|     Paris Hutton|    DB| 481229|   57|Clementina Menke_...|
| 28|Female| Mickey Cortright|    DB| 551389|   43|

In [20]:
# Multiplas condições
df.filter( (df.course == "DB") & (df.marks > 50) ).show()

+---+------+------------------+------+-------+-----+--------------------+
|age|gender|              name|course|   roll|marks|               email|
+---+------+------------------+------+-------+-----+--------------------+
| 28|Female|   Hubert Oliveras|    DB|   2984|   59|Annika Hoffman_Na...|
| 29|  Male|   Ernest Rossbach|    DB| 111449|   53|Maybell Duguay_Ab...|
| 28|Female|  Mickey Cortright|    DB| 192537|   62|Ernest Rossbach_M...|
| 28|Female|       Anna Santos|    DB| 311589|   79|Celeste Lollis_Mi...|
| 29|  Male|      Paris Hutton|    DB| 481229|   57|Clementina Menke_...|
| 28|Female|   Hubert Oliveras|    DB| 771081|   79|Kizzy Brenner_Dus...|
| 29|Female|      Elenore Choy|    DB| 811824|   55|Maybell Duguay_Me...|
| 29|  Male|  Clementina Menke|    DB| 882200|   76|Michelle Ruggiero...|
| 29|Female|   Sebrina Maresca|    DB| 922210|   54|Toshiko Hillyard_...|
| 29|  Male|      Naoma Fritts|    DB| 931295|   79|Hubert Oliveras_S...|
| 29|Female|      Claude Panos|    DB|

In [21]:
courses = ["DB", "Cloud", "OPP", "DSA"]
df.filter( df.course.isin(courses) ).show()

+---+------+----------------+------+------+-----+--------------------+
|age|gender|            name|course|  roll|marks|               email|
+---+------+----------------+------+------+-----+--------------------+
| 28|Female| Hubert Oliveras|    DB|  2984|   59|Annika Hoffman_Na...|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   62|Margene Moores_Ma...|
| 29|Female|    Elenore Choy|    DB| 32877|   29|Billi Clore_Mitzi...|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   41|Claude Panos_Judi...|
| 28|Female|    Claude Panos| Cloud| 72409|   85|Sheryll Towler_Al...|
| 29|Female|       Kena Wild|   DSA|102285|   35|Dustin Feagins_Ma...|
| 29|  Male| Ernest Rossbach|    DB|111449|   53|Maybell Duguay_Ab...|
| 28|Female|  Latia Vanhoose|    DB|122502|   27|Latia Vanhoose_Mi...|
| 29|Female|  Latia Vanhoose|    DB|152159|   27|Claude Panos_Sant...|
| 28|Female|Mickey Cortright|    DB|192537|   62|Ernest Rossbach_M...|
| 28|Female|       Kena Wild| Cloud|221750|   60|Mitzi Seldon_Jenn...|
| 28|F

In [22]:
#inicia com a expressão
df.filter( df.course.startswith("D") ).show()

+---+------+----------------+------+------+-----+--------------------+
|age|gender|            name|course|  roll|marks|               email|
+---+------+----------------+------+------+-----+--------------------+
| 28|Female| Hubert Oliveras|    DB|  2984|   59|Annika Hoffman_Na...|
| 29|Female|    Elenore Choy|    DB| 32877|   29|Billi Clore_Mitzi...|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   41|Claude Panos_Judi...|
| 29|Female|       Kena Wild|   DSA|102285|   35|Dustin Feagins_Ma...|
| 29|  Male| Ernest Rossbach|    DB|111449|   53|Maybell Duguay_Ab...|
| 28|Female|  Latia Vanhoose|    DB|122502|   27|Latia Vanhoose_Mi...|
| 29|Female|  Latia Vanhoose|    DB|152159|   27|Claude Panos_Sant...|
| 28|Female|Mickey Cortright|    DB|192537|   62|Ernest Rossbach_M...|
| 28|Female|    Jc Andrepont|   DSA|232060|   58|Billi Clore_Abram...|
| 29|Female|    Paris Hutton|   DSA|271472|   99|Sheryll Towler_Al...|
| 28|Female|  Dustin Feagins|   DSA|291984|   82|Abram Nagao_Kena ...|
| 28|F

In [24]:
# Termina com a expressão
df.filter( df.name.endswith("se") ).show()

+---+------+--------------+------+-------+-----+--------------------+
|age|gender|          name|course|   roll|marks|               email|
+---+------+--------------+------+-------+-----+--------------------+
| 28|Female|Latia Vanhoose|    DB| 122502|   27|Latia Vanhoose_Mi...|
| 29|Female|Latia Vanhoose|   MVC| 132110|   55|Eda Neathery_Nico...|
| 29|Female|Latia Vanhoose|    DB| 152159|   27|Claude Panos_Sant...|
| 29|  Male|Latia Vanhoose| Cloud|1832268|   60|Marylee Capasso_S...|
| 29|  Male|Latia Vanhoose|   OOP|2372748|   94|Latia Vanhoose_La...|
| 29|Female|Latia Vanhoose|    PF|2861854|   42|Claude Panos_Nico...|
| 29|  Male|Latia Vanhoose|   MVC|2992281|   90|Elenore Choy_Cord...|
| 29|Female|Latia Vanhoose|   MVC|3091650|   30|Cordie Harnois_Se...|
| 29|Female|Latia Vanhoose|   OOP|3841395|   26|Kizzy Brenner_Eda...|
| 29|  Male|Latia Vanhoose| Cloud|4661276|   40|Jc Andrepont_Anni...|
| 28|Female|Latia Vanhoose|   OOP|4792828|   72|Tamera Blakley_Mi...|
| 28|Female|Latia Va

In [25]:
#contém a expressão
df.filter( df.name.contains("se") ).show()

+---+------+--------------+------+-------+-----+--------------------+
|age|gender|          name|course|   roll|marks|               email|
+---+------+--------------+------+-------+-----+--------------------+
| 28|Female|Latia Vanhoose|    DB| 122502|   27|Latia Vanhoose_Mi...|
| 29|Female|Latia Vanhoose|   MVC| 132110|   55|Eda Neathery_Nico...|
| 29|Female|Latia Vanhoose|    DB| 152159|   27|Claude Panos_Sant...|
| 29|  Male|Loris Crossett|   MVC| 161771|   36|Mitzi Seldon_Jenn...|
| 29|Female|Loris Crossett|    PF| 201487|   96|Elenore Choy_Lati...|
| 28|Female|Loris Crossett|    PF| 332739|   62|Michelle Ruggiero...|
| 29|  Male|Loris Crossett|    PF| 911593|   46|Gonzalo Ferebee_M...|
| 28|Female|Loris Crossett|   DSA|1662549|   86|Paris Hutton_Lati...|
| 29|  Male|Latia Vanhoose| Cloud|1832268|   60|Marylee Capasso_S...|
| 29|  Male|Latia Vanhoose|   OOP|2372748|   94|Latia Vanhoose_La...|
| 28|Female|Loris Crossett|   OOP|2691881|   29|Maybell Duguay_Ni...|
| 28|  Male|Loris Cr

In [26]:
# utilizando expressões em SQL, para mais expressões ver a documentação SQL
df.filter( df.name.like('%se%') ).show()

+---+------+--------------+------+-------+-----+--------------------+
|age|gender|          name|course|   roll|marks|               email|
+---+------+--------------+------+-------+-----+--------------------+
| 28|Female|Latia Vanhoose|    DB| 122502|   27|Latia Vanhoose_Mi...|
| 29|Female|Latia Vanhoose|   MVC| 132110|   55|Eda Neathery_Nico...|
| 29|Female|Latia Vanhoose|    DB| 152159|   27|Claude Panos_Sant...|
| 29|  Male|Loris Crossett|   MVC| 161771|   36|Mitzi Seldon_Jenn...|
| 29|Female|Loris Crossett|    PF| 201487|   96|Elenore Choy_Lati...|
| 28|Female|Loris Crossett|    PF| 332739|   62|Michelle Ruggiero...|
| 29|  Male|Loris Crossett|    PF| 911593|   46|Gonzalo Ferebee_M...|
| 28|Female|Loris Crossett|   DSA|1662549|   86|Paris Hutton_Lati...|
| 29|  Male|Latia Vanhoose| Cloud|1832268|   60|Marylee Capasso_S...|
| 29|  Male|Latia Vanhoose|   OOP|2372748|   94|Latia Vanhoose_La...|
| 28|Female|Loris Crossett|   OOP|2691881|   29|Maybell Duguay_Ni...|
| 28|  Male|Loris Cr

### Exemplo

* **utilizando o banco StudentData.csv**

* Ler o arquivo em DF

* Criar uma nova coluna no DF para as notas totais e deixe as notas totais serem 120

* Criar uma nova coluna com a média das notas $\frac{notas}{totalnotal}*100$

* filtre todos os alunos que obtiveram mais de 80% de notas no curso OOP e salve-o em um novo DF

* filtre todos os alunos que obtiveram mais de 60% de notas no curso Cloud e salve-o em um novo df

* Imprima os nomes e notas de todos os alunos do DF acima

In [96]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit
spark = SparkSession.builder.appName("Spark DataFrames").getOrCreate()
df = spark.read.options(header = 'True', inferSchema = 'True').csv('/content/drive/MyDrive/PySpark/data/StudentData.csv')

In [97]:
df2 = df
df2 = df2.withColumn('total_marks', lit("120"))

In [98]:
df2 = df2.withColumn( 'average', (col("marks") / col("total_marks")*100) )
df2.show()

+---+------+----------------+------+------+-----+--------------------+-----------+------------------+
|age|gender|            name|course|  roll|marks|               email|total_marks|           average|
+---+------+----------------+------+------+-----+--------------------+-----------+------------------+
| 28|Female| Hubert Oliveras|    DB|  2984|   59|Annika Hoffman_Na...|        120|49.166666666666664|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   62|Margene Moores_Ma...|        120| 51.66666666666667|
| 28|  Male|  Celeste Lollis|    PF| 21267|   45|Jeannetta Golden_...|        120|              37.5|
| 29|Female|    Elenore Choy|    DB| 32877|   29|Billi Clore_Mitzi...|        120|24.166666666666668|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   41|Claude Panos_Judi...|        120|34.166666666666664|
| 28|  Male|  Margene Moores|   MVC| 52771|   32|Toshiko Hillyard_...|        120|26.666666666666668|
| 28|  Male|     Neda Briski|   OOP| 61973|   69|Alberta Freund_El...|        120|

In [99]:
df_OOP = df2.filter( (df2.course == "OOP") & (df2.average > 80) )
df_OOP.show()

+---+------+------------------+------+-------+-----+--------------------+-----------+-----------------+
|age|gender|              name|course|   roll|marks|               email|total_marks|          average|
+---+------+------------------+------+-------+-----+--------------------+-----------+-----------------+
| 28|  Male|    Jenna Montague|   OOP|3331161|   98|Leontine Phillips...|        120|81.66666666666667|
| 29|Female|Priscila Tavernier|   OOP|3902993|   99|Celeste Lollis_Bi...|        120|             82.5|
| 28|Female|      Judie Chipps|   OOP|5451977|   99|Tamera Blakley_Mi...|        120|             82.5|
| 29|  Male|    Margene Moores|   OOP|5621072|   97|Sheryll Towler_Ma...|        120|80.83333333333333|
| 29|  Male|      Jc Andrepont|   OOP|8022618|   97|Cordie Harnois_Ja...|        120|80.83333333333333|
| 28|  Male|    Loris Crossett|   OOP|8172914|   98|Paris Hutton_Pari...|        120|81.66666666666667|
| 28|  Male|    Loris Crossett|   OOP|9692316|   99|Judie Chipps

In [100]:
df_CLOUD = df2.filter( (df2.course == "Cloud") & (df2.average > 60) )
df_CLOUD.show()

+---+------+-----------------+------+-------+-----+--------------------+-----------+-----------------+
|age|gender|             name|course|   roll|marks|               email|total_marks|          average|
+---+------+-----------------+------+-------+-----+--------------------+-----------+-----------------+
| 28|Female|     Claude Panos| Cloud|  72409|   85|Sheryll Towler_Al...|        120|70.83333333333334|
| 29|  Male|      Billi Clore| Cloud| 512047|   76|Taryn Brownlee_Ju...|        120|63.33333333333333|
| 28|Female|   Somer Stoecker| Cloud| 612490|   82|Sebrina Maresca_G...|        120|68.33333333333333|
| 29|Female|     Judie Chipps| Cloud| 632793|   75|Tijuana Kropf_Ele...|        120|             62.5|
| 29|Female|     Eda Neathery| Cloud|1011971|   91|Margene Moores_El...|        120|75.83333333333333|
| 28|  Male|   Bonita Higuera| Cloud|1312294|   94|Eda Neathery_Pris...|        120|78.33333333333333|
| 29|Female|  Hubert Oliveras| Cloud|1392791|   94|Anna Santos_Alber...| 

In [101]:
df_OOP.select("name", "marks").show()

+------------------+-----+
|              name|marks|
+------------------+-----+
|    Jenna Montague|   98|
|Priscila Tavernier|   99|
|      Judie Chipps|   99|
|    Margene Moores|   97|
|      Jc Andrepont|   97|
|    Loris Crossett|   98|
|    Loris Crossett|   99|
+------------------+-----+



In [102]:
df_CLOUD.select("name", "marks").show()

+-----------------+-----+
|             name|marks|
+-----------------+-----+
|     Claude Panos|   85|
|      Billi Clore|   76|
|   Somer Stoecker|   82|
|     Judie Chipps|   75|
|     Eda Neathery|   91|
|   Bonita Higuera|   94|
|  Hubert Oliveras|   94|
|      Neda Briski|   74|
|   Melani Engberg|   99|
|     Paris Hutton|   79|
|     Eda Neathery|   95|
|      Neda Briski|   81|
|    Tijuana Kropf|   78|
|   Jenna Montague|   96|
|   Dustin Feagins|   89|
|  Ernest Rossbach|   83|
|Leontine Phillips|   76|
|  Sebrina Maresca|   97|
| Clementina Menke|   95|
|    Kizzy Brenner|   80|
+-----------------+-----+
only showing top 20 rows

