<a href="https://colab.research.google.com/github/gabgovar/Apache-Spark/blob/main/Spark_DataFrames.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spark DataFrame

## Montando o Google Colab no Google Drive

In [13]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Instalando as Demendências do Hadoop Spark no Google Colab

In [1]:
# instalar as dependências
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark

## Configurando as Demendências do Hadoop Spark no Google Colab

In [11]:
# configurar as variáveis de ambiente
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"

# tornar o pyspark "importável"
import findspark
findspark.init('spark-2.4.4-bin-hadoop2.7')
findspark.find()

'spark-2.4.4-bin-hadoop2.7/python/pyspark'

## Leitura de um DataFrame pelo Spark

In [12]:
from pyspark.sql import SparkSession

spark = SparkSession\
        .builder\
        .appName("Spark_DataFrames")\
        .getOrCreate()

* A primeira linha são as informações do DF, então o .option("header", True) insere o cabeçalho no DF

In [16]:
df = spark.read.option("header", True).csv('/content/drive/MyDrive/PySpark/data/StudentData.csv')
df.show()

+---+------+----------------+------+------+-----+--------------------+
|age|gender|            name|course|  roll|marks|               email|
+---+------+----------------+------+------+-----+--------------------+
| 28|Female| Hubert Oliveras|    DB| 02984|   59|Annika Hoffman_Na...|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   62|Margene Moores_Ma...|
| 28|  Male|  Celeste Lollis|    PF| 21267|   45|Jeannetta Golden_...|
| 29|Female|    Elenore Choy|    DB| 32877|   29|Billi Clore_Mitzi...|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   41|Claude Panos_Judi...|
| 28|  Male|  Margene Moores|   MVC| 52771|   32|Toshiko Hillyard_...|
| 28|  Male|     Neda Briski|   OOP| 61973|   69|Alberta Freund_El...|
| 28|Female|    Claude Panos| Cloud| 72409|   85|Sheryll Towler_Al...|
| 28|  Male|  Celeste Lollis|   MVC| 81492|   64|Nicole Harwood_Cl...|
| 29|  Male|  Cordie Harnois|   OOP| 92882|   51|Judie Chipps_Clem...|
| 29|Female|       Kena Wild|   DSA|102285|   35|Dustin Feagins_Ma...|
| 29| 

## Schema de um DataFrame

* option("inferSchema", True) ⇒ define automáticamente o Type do dado no interior do DF

* se estiver utilizando um delimitador diferente da virgula (,), por exemplo tab (tsv, separado por tab), dentro do options inserir o comando .option(delimiter = '\t')

In [23]:
df = spark.read.options(inferSchema='True', header = 'True', delimiter = ',').csv('/content/drive/MyDrive/PySpark/data/StudentData.csv')

In [24]:
df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- name: string (nullable = true)
 |-- course: string (nullable = true)
 |-- roll: integer (nullable = true)
 |-- marks: integer (nullable = true)
 |-- email: string (nullable = true)



In [25]:
df.show()

+---+------+----------------+------+------+-----+--------------------+
|age|gender|            name|course|  roll|marks|               email|
+---+------+----------------+------+------+-----+--------------------+
| 28|Female| Hubert Oliveras|    DB|  2984|   59|Annika Hoffman_Na...|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   62|Margene Moores_Ma...|
| 28|  Male|  Celeste Lollis|    PF| 21267|   45|Jeannetta Golden_...|
| 29|Female|    Elenore Choy|    DB| 32877|   29|Billi Clore_Mitzi...|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   41|Claude Panos_Judi...|
| 28|  Male|  Margene Moores|   MVC| 52771|   32|Toshiko Hillyard_...|
| 28|  Male|     Neda Briski|   OOP| 61973|   69|Alberta Freund_El...|
| 28|Female|    Claude Panos| Cloud| 72409|   85|Sheryll Towler_Al...|
| 28|  Male|  Celeste Lollis|   MVC| 81492|   64|Nicole Harwood_Cl...|
| 29|  Male|  Cordie Harnois|   OOP| 92882|   51|Judie Chipps_Clem...|
| 29|Female|       Kena Wild|   DSA|102285|   35|Dustin Feagins_Ma...|
| 29| 

## Fornecendo o Schema DataFrame

In [26]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
                     StructField("age", IntegerType(), True),
                     StructField("gender", StringType(), True),
                     StructField("name", StringType(), True),
                     StructField("course", StringType(), True),
                     StructField("roll", StringType(), True),
                     StructField("marks", IntegerType(), True),
                     StructField("email", StringType(), True)
])

In [29]:
df = spark.read.options(header = 'True').schema(schema).csv('/content/drive/MyDrive/PySpark/data/StudentData.csv')
df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- name: string (nullable = true)
 |-- course: string (nullable = true)
 |-- roll: string (nullable = true)
 |-- marks: integer (nullable = true)
 |-- email: string (nullable = true)



## Criando um DF a partir de um RDD

In [45]:
from pyspark.sql import SparkSession
spark = SparkSession\
        .builder\
        .appName("Spark DataFrame")\
        .get0rCreate()

AttributeError: ignored

In [52]:
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName("RDD")
sc = SparkContext.getOrCreate(conf=conf)

rdd = sc.textFile('/content/drive/MyDrive/PySpark/data/StudentData.csv')
headers = rdd.first()

rdd = rdd.filter(lambda x: x!= headers).map(lambda x: x.split(','))
rdd = rdd.map(lambda x: [int(x[0]), x[1], x[2], x[3], x[4], int(x[5]), x[6]])

In [53]:
columns = headers.split(",")
dfRdd = rdd.toDF(columns)
dfRdd.show()

+---+------+----------------+------+------+-----+--------------------+
|age|gender|            name|course|  roll|marks|               email|
+---+------+----------------+------+------+-----+--------------------+
| 28|Female| Hubert Oliveras|    DB| 02984|   59|Annika Hoffman_Na...|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   62|Margene Moores_Ma...|
| 28|  Male|  Celeste Lollis|    PF| 21267|   45|Jeannetta Golden_...|
| 29|Female|    Elenore Choy|    DB| 32877|   29|Billi Clore_Mitzi...|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   41|Claude Panos_Judi...|
| 28|  Male|  Margene Moores|   MVC| 52771|   32|Toshiko Hillyard_...|
| 28|  Male|     Neda Briski|   OOP| 61973|   69|Alberta Freund_El...|
| 28|Female|    Claude Panos| Cloud| 72409|   85|Sheryll Towler_Al...|
| 28|  Male|  Celeste Lollis|   MVC| 81492|   64|Nicole Harwood_Cl...|
| 29|  Male|  Cordie Harnois|   OOP| 92882|   51|Judie Chipps_Clem...|
| 29|Female|       Kena Wild|   DSA|102285|   35|Dustin Feagins_Ma...|
| 29| 

In [54]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
                     StructField("age", IntegerType(), True),
                     StructField("gender", StringType(), True),
                     StructField("name", StringType(), True),
                     StructField("course", StringType(), True),
                     StructField("roll", StringType(), True),
                     StructField("marks", IntegerType(), True),
                     StructField("email", StringType(), True)
])

In [55]:
dfRdd2 = spark.createDataFrame(rdd, schema=schema)
dfRdd2.show()
dfRdd2.printSchema()

+---+------+----------------+------+------+-----+--------------------+
|age|gender|            name|course|  roll|marks|               email|
+---+------+----------------+------+------+-----+--------------------+
| 28|Female| Hubert Oliveras|    DB| 02984|   59|Annika Hoffman_Na...|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   62|Margene Moores_Ma...|
| 28|  Male|  Celeste Lollis|    PF| 21267|   45|Jeannetta Golden_...|
| 29|Female|    Elenore Choy|    DB| 32877|   29|Billi Clore_Mitzi...|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   41|Claude Panos_Judi...|
| 28|  Male|  Margene Moores|   MVC| 52771|   32|Toshiko Hillyard_...|
| 28|  Male|     Neda Briski|   OOP| 61973|   69|Alberta Freund_El...|
| 28|Female|    Claude Panos| Cloud| 72409|   85|Sheryll Towler_Al...|
| 28|  Male|  Celeste Lollis|   MVC| 81492|   64|Nicole Harwood_Cl...|
| 29|  Male|  Cordie Harnois|   OOP| 92882|   51|Judie Chipps_Clem...|
| 29|Female|       Kena Wild|   DSA|102285|   35|Dustin Feagins_Ma...|
| 29| 

## Select colunas do DataFrame

In [56]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark DataFrames").getOrCreate()


In [58]:
df = spark.read.options(header = 'True', inferSchema = 'True').csv('/content/drive/MyDrive/PySpark/data/StudentData.csv')
df.show()

+---+------+----------------+------+------+-----+--------------------+
|age|gender|            name|course|  roll|marks|               email|
+---+------+----------------+------+------+-----+--------------------+
| 28|Female| Hubert Oliveras|    DB|  2984|   59|Annika Hoffman_Na...|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   62|Margene Moores_Ma...|
| 28|  Male|  Celeste Lollis|    PF| 21267|   45|Jeannetta Golden_...|
| 29|Female|    Elenore Choy|    DB| 32877|   29|Billi Clore_Mitzi...|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   41|Claude Panos_Judi...|
| 28|  Male|  Margene Moores|   MVC| 52771|   32|Toshiko Hillyard_...|
| 28|  Male|     Neda Briski|   OOP| 61973|   69|Alberta Freund_El...|
| 28|Female|    Claude Panos| Cloud| 72409|   85|Sheryll Towler_Al...|
| 28|  Male|  Celeste Lollis|   MVC| 81492|   64|Nicole Harwood_Cl...|
| 29|  Male|  Cordie Harnois|   OOP| 92882|   51|Judie Chipps_Clem...|
| 29|Female|       Kena Wild|   DSA|102285|   35|Dustin Feagins_Ma...|
| 29| 

* Como selecionar colunas no DF

In [61]:
df.select("name","gender").show()

+----------------+------+
|            name|gender|
+----------------+------+
| Hubert Oliveras|Female|
|Toshiko Hillyard|Female|
|  Celeste Lollis|  Male|
|    Elenore Choy|Female|
|  Sheryll Towler|  Male|
|  Margene Moores|  Male|
|     Neda Briski|  Male|
|    Claude Panos|Female|
|  Celeste Lollis|  Male|
|  Cordie Harnois|  Male|
|       Kena Wild|Female|
| Ernest Rossbach|  Male|
|  Latia Vanhoose|Female|
|  Latia Vanhoose|Female|
|     Neda Briski|  Male|
|  Latia Vanhoose|Female|
|  Loris Crossett|  Male|
|  Annika Hoffman|  Male|
|   Santa Kerfien|  Male|
|Mickey Cortright|Female|
+----------------+------+
only showing top 20 rows



In [62]:
df.select(df.name, df.email).show()

+----------------+--------------------+
|            name|               email|
+----------------+--------------------+
| Hubert Oliveras|Annika Hoffman_Na...|
|Toshiko Hillyard|Margene Moores_Ma...|
|  Celeste Lollis|Jeannetta Golden_...|
|    Elenore Choy|Billi Clore_Mitzi...|
|  Sheryll Towler|Claude Panos_Judi...|
|  Margene Moores|Toshiko Hillyard_...|
|     Neda Briski|Alberta Freund_El...|
|    Claude Panos|Sheryll Towler_Al...|
|  Celeste Lollis|Nicole Harwood_Cl...|
|  Cordie Harnois|Judie Chipps_Clem...|
|       Kena Wild|Dustin Feagins_Ma...|
| Ernest Rossbach|Maybell Duguay_Ab...|
|  Latia Vanhoose|Latia Vanhoose_Mi...|
|  Latia Vanhoose|Eda Neathery_Nico...|
|     Neda Briski|Margene Moores_Mi...|
|  Latia Vanhoose|Claude Panos_Sant...|
|  Loris Crossett|Mitzi Seldon_Jenn...|
|  Annika Hoffman|Taryn Brownlee_Mi...|
|   Santa Kerfien|Judie Chipps_Tary...|
|Mickey Cortright|Ernest Rossbach_M...|
+----------------+--------------------+
only showing top 20 rows



In [63]:
from pyspark.sql.functions import col

df.select(col("roll"), col("name")).show()

+------+----------------+
|  roll|            name|
+------+----------------+
|  2984| Hubert Oliveras|
| 12899|Toshiko Hillyard|
| 21267|  Celeste Lollis|
| 32877|    Elenore Choy|
| 41487|  Sheryll Towler|
| 52771|  Margene Moores|
| 61973|     Neda Briski|
| 72409|    Claude Panos|
| 81492|  Celeste Lollis|
| 92882|  Cordie Harnois|
|102285|       Kena Wild|
|111449| Ernest Rossbach|
|122502|  Latia Vanhoose|
|132110|  Latia Vanhoose|
|141770|     Neda Briski|
|152159|  Latia Vanhoose|
|161771|  Loris Crossett|
|171660|  Annika Hoffman|
|182129|   Santa Kerfien|
|192537|Mickey Cortright|
+------+----------------+
only showing top 20 rows



In [64]:
df.select('*').show()

+---+------+----------------+------+------+-----+--------------------+
|age|gender|            name|course|  roll|marks|               email|
+---+------+----------------+------+------+-----+--------------------+
| 28|Female| Hubert Oliveras|    DB|  2984|   59|Annika Hoffman_Na...|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   62|Margene Moores_Ma...|
| 28|  Male|  Celeste Lollis|    PF| 21267|   45|Jeannetta Golden_...|
| 29|Female|    Elenore Choy|    DB| 32877|   29|Billi Clore_Mitzi...|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   41|Claude Panos_Judi...|
| 28|  Male|  Margene Moores|   MVC| 52771|   32|Toshiko Hillyard_...|
| 28|  Male|     Neda Briski|   OOP| 61973|   69|Alberta Freund_El...|
| 28|Female|    Claude Panos| Cloud| 72409|   85|Sheryll Towler_Al...|
| 28|  Male|  Celeste Lollis|   MVC| 81492|   64|Nicole Harwood_Cl...|
| 29|  Male|  Cordie Harnois|   OOP| 92882|   51|Judie Chipps_Clem...|
| 29|Female|       Kena Wild|   DSA|102285|   35|Dustin Feagins_Ma...|
| 29| 

In [68]:
df.columns
df.select('age', 'gender', 'email').show()

+---+------+--------------------+
|age|gender|               email|
+---+------+--------------------+
| 28|Female|Annika Hoffman_Na...|
| 29|Female|Margene Moores_Ma...|
| 28|  Male|Jeannetta Golden_...|
| 29|Female|Billi Clore_Mitzi...|
| 28|  Male|Claude Panos_Judi...|
| 28|  Male|Toshiko Hillyard_...|
| 28|  Male|Alberta Freund_El...|
| 28|Female|Sheryll Towler_Al...|
| 28|  Male|Nicole Harwood_Cl...|
| 29|  Male|Judie Chipps_Clem...|
| 29|Female|Dustin Feagins_Ma...|
| 29|  Male|Maybell Duguay_Ab...|
| 28|Female|Latia Vanhoose_Mi...|
| 29|Female|Eda Neathery_Nico...|
| 29|  Male|Margene Moores_Mi...|
| 29|Female|Claude Panos_Sant...|
| 29|  Male|Mitzi Seldon_Jenn...|
| 29|  Male|Taryn Brownlee_Mi...|
| 29|  Male|Judie Chipps_Tary...|
| 28|Female|Ernest Rossbach_M...|
+---+------+--------------------+
only showing top 20 rows



In [73]:
df.select(df.columns[2:6]).show()

+----------------+------+------+-----+
|            name|course|  roll|marks|
+----------------+------+------+-----+
| Hubert Oliveras|    DB|  2984|   59|
|Toshiko Hillyard| Cloud| 12899|   62|
|  Celeste Lollis|    PF| 21267|   45|
|    Elenore Choy|    DB| 32877|   29|
|  Sheryll Towler|   DSA| 41487|   41|
|  Margene Moores|   MVC| 52771|   32|
|     Neda Briski|   OOP| 61973|   69|
|    Claude Panos| Cloud| 72409|   85|
|  Celeste Lollis|   MVC| 81492|   64|
|  Cordie Harnois|   OOP| 92882|   51|
|       Kena Wild|   DSA|102285|   35|
| Ernest Rossbach|    DB|111449|   53|
|  Latia Vanhoose|    DB|122502|   27|
|  Latia Vanhoose|   MVC|132110|   55|
|     Neda Briski|    PF|141770|   42|
|  Latia Vanhoose|    DB|152159|   27|
|  Loris Crossett|   MVC|161771|   36|
|  Annika Hoffman|   OOP|171660|   22|
|   Santa Kerfien|    PF|182129|   56|
|Mickey Cortright|    DB|192537|   62|
+----------------+------+------+-----+
only showing top 20 rows



In [77]:
df2 = df.select(col("roll"), col("name"))

In [78]:
df2.show()

+------+----------------+
|  roll|            name|
+------+----------------+
|  2984| Hubert Oliveras|
| 12899|Toshiko Hillyard|
| 21267|  Celeste Lollis|
| 32877|    Elenore Choy|
| 41487|  Sheryll Towler|
| 52771|  Margene Moores|
| 61973|     Neda Briski|
| 72409|    Claude Panos|
| 81492|  Celeste Lollis|
| 92882|  Cordie Harnois|
|102285|       Kena Wild|
|111449| Ernest Rossbach|
|122502|  Latia Vanhoose|
|132110|  Latia Vanhoose|
|141770|     Neda Briski|
|152159|  Latia Vanhoose|
|161771|  Loris Crossett|
|171660|  Annika Hoffman|
|182129|   Santa Kerfien|
|192537|Mickey Cortright|
+------+----------------+
only showing top 20 rows



## with column no DataFrame

In [79]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark DataFrames").getOrCreate()
df = spark.read.options(header = 'True', inferSchema = 'True').csv('/content/drive/MyDrive/PySpark/data/StudentData.csv')

In [80]:
df = spark.read.options(header = 'True', inferSchema = 'True').csv('/content/drive/MyDrive/PySpark/data/StudentData.csv')
df.show()

+---+------+----------------+------+------+-----+--------------------+
|age|gender|            name|course|  roll|marks|               email|
+---+------+----------------+------+------+-----+--------------------+
| 28|Female| Hubert Oliveras|    DB|  2984|   59|Annika Hoffman_Na...|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   62|Margene Moores_Ma...|
| 28|  Male|  Celeste Lollis|    PF| 21267|   45|Jeannetta Golden_...|
| 29|Female|    Elenore Choy|    DB| 32877|   29|Billi Clore_Mitzi...|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   41|Claude Panos_Judi...|
| 28|  Male|  Margene Moores|   MVC| 52771|   32|Toshiko Hillyard_...|
| 28|  Male|     Neda Briski|   OOP| 61973|   69|Alberta Freund_El...|
| 28|Female|    Claude Panos| Cloud| 72409|   85|Sheryll Towler_Al...|
| 28|  Male|  Celeste Lollis|   MVC| 81492|   64|Nicole Harwood_Cl...|
| 29|  Male|  Cordie Harnois|   OOP| 92882|   51|Judie Chipps_Clem...|
| 29|Female|       Kena Wild|   DSA|102285|   35|Dustin Feagins_Ma...|
| 29| 

In [101]:
from pyspark.sql.functions import col, lit
df = df.withColumn("roll", col("roll").cast("string"))

In [94]:
df.show()

+---+------+----------------+------+------+-----+--------------------+
|age|gender|            name|course|  roll|marks|               email|
+---+------+----------------+------+------+-----+--------------------+
| 28|Female| Hubert Oliveras|    DB|  2984|   59|Annika Hoffman_Na...|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   62|Margene Moores_Ma...|
| 28|  Male|  Celeste Lollis|    PF| 21267|   45|Jeannetta Golden_...|
| 29|Female|    Elenore Choy|    DB| 32877|   29|Billi Clore_Mitzi...|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   41|Claude Panos_Judi...|
| 28|  Male|  Margene Moores|   MVC| 52771|   32|Toshiko Hillyard_...|
| 28|  Male|     Neda Briski|   OOP| 61973|   69|Alberta Freund_El...|
| 28|Female|    Claude Panos| Cloud| 72409|   85|Sheryll Towler_Al...|
| 28|  Male|  Celeste Lollis|   MVC| 81492|   64|Nicole Harwood_Cl...|
| 29|  Male|  Cordie Harnois|   OOP| 92882|   51|Judie Chipps_Clem...|
| 29|Female|       Kena Wild|   DSA|102285|   35|Dustin Feagins_Ma...|
| 29| 

In [95]:
df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- name: string (nullable = true)
 |-- course: string (nullable = true)
 |-- roll: string (nullable = true)
 |-- marks: integer (nullable = true)
 |-- email: string (nullable = true)



In [98]:
# Adicionando 10 a coluna marks
df = df.withColumn("marks", col('marks') + 10)
df.show()

+---+------+----------------+------+------+-----+--------------------+
|age|gender|            name|course|  roll|marks|               email|
+---+------+----------------+------+------+-----+--------------------+
| 28|Female| Hubert Oliveras|    DB|  2984|   69|Annika Hoffman_Na...|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   72|Margene Moores_Ma...|
| 28|  Male|  Celeste Lollis|    PF| 21267|   55|Jeannetta Golden_...|
| 29|Female|    Elenore Choy|    DB| 32877|   39|Billi Clore_Mitzi...|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   51|Claude Panos_Judi...|
| 28|  Male|  Margene Moores|   MVC| 52771|   42|Toshiko Hillyard_...|
| 28|  Male|     Neda Briski|   OOP| 61973|   79|Alberta Freund_El...|
| 28|Female|    Claude Panos| Cloud| 72409|   95|Sheryll Towler_Al...|
| 28|  Male|  Celeste Lollis|   MVC| 81492|   74|Nicole Harwood_Cl...|
| 29|  Male|  Cordie Harnois|   OOP| 92882|   61|Judie Chipps_Clem...|
| 29|Female|       Kena Wild|   DSA|102285|   45|Dustin Feagins_Ma...|
| 29| 

In [108]:
#Criando uma nova coluna
df = df.withColumn("aggregated marks", col('marks') -10)
df.show()

+---+------+----------------+------+------+-----+--------------------+----------------+-------+
|age|gender|            name|course|  roll|marks|               email|aggregated marks|Country|
+---+------+----------------+------+------+-----+--------------------+----------------+-------+
| 28|Female| Hubert Oliveras|    DB|  2984|   59|Annika Hoffman_Na...|              49|    USA|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   62|Margene Moores_Ma...|              52|    USA|
| 28|  Male|  Celeste Lollis|    PF| 21267|   45|Jeannetta Golden_...|              35|    USA|
| 29|Female|    Elenore Choy|    DB| 32877|   29|Billi Clore_Mitzi...|              19|    USA|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   41|Claude Panos_Judi...|              31|    USA|
| 28|  Male|  Margene Moores|   MVC| 52771|   32|Toshiko Hillyard_...|              22|    USA|
| 28|  Male|     Neda Briski|   OOP| 61973|   69|Alberta Freund_El...|              59|    USA|
| 28|Female|    Claude Panos| Cloud| 724

In [104]:
from pyspark.sql.functions import col, lit
df = df.withColumn("Country", lit("USA"))
df.show()

+---+------+----------------+------+------+-----+--------------------+----------------+-------+
|age|gender|            name|course|  roll|marks|               email|aggregated marks|Country|
+---+------+----------------+------+------+-----+--------------------+----------------+-------+
| 28|Female| Hubert Oliveras|    DB|  2984|   59|Annika Hoffman_Na...|              49|    USA|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   62|Margene Moores_Ma...|              52|    USA|
| 28|  Male|  Celeste Lollis|    PF| 21267|   45|Jeannetta Golden_...|              35|    USA|
| 29|Female|    Elenore Choy|    DB| 32877|   29|Billi Clore_Mitzi...|              19|    USA|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   41|Claude Panos_Judi...|              31|    USA|
| 28|  Male|  Margene Moores|   MVC| 52771|   32|Toshiko Hillyard_...|              22|    USA|
| 28|  Male|     Neda Briski|   OOP| 61973|   69|Alberta Freund_El...|              59|    USA|
| 28|Female|    Claude Panos| Cloud| 724

In [110]:
df.withColumn("marks", col("marks")- 10).withColumn("updated marks", col("marks") + 20).withColumn("Country", lit("USA")).show()

+---+------+----------------+------+------+-----+--------------------+----------------+-------+-------------+
|age|gender|            name|course|  roll|marks|               email|aggregated marks|Country|updated marks|
+---+------+----------------+------+------+-----+--------------------+----------------+-------+-------------+
| 28|Female| Hubert Oliveras|    DB|  2984|   49|Annika Hoffman_Na...|              49|    USA|           69|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   52|Margene Moores_Ma...|              52|    USA|           72|
| 28|  Male|  Celeste Lollis|    PF| 21267|   35|Jeannetta Golden_...|              35|    USA|           55|
| 29|Female|    Elenore Choy|    DB| 32877|   19|Billi Clore_Mitzi...|              19|    USA|           39|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   31|Claude Panos_Judi...|              31|    USA|           51|
| 28|  Male|  Margene Moores|   MVC| 52771|   22|Toshiko Hillyard_...|              22|    USA|           42|
| 28|  Mal

### Spark DF withcolumn renomendo e alias

In [112]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark DataFrames").getOrCreate()
df = spark.read.options(header = 'True', inferSchema = 'True').csv('/content/drive/MyDrive/PySpark/data/StudentData.csv')
df.show()

+---+------+----------------+------+------+-----+--------------------+
|age|gender|            name|course|  roll|marks|               email|
+---+------+----------------+------+------+-----+--------------------+
| 28|Female| Hubert Oliveras|    DB|  2984|   59|Annika Hoffman_Na...|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   62|Margene Moores_Ma...|
| 28|  Male|  Celeste Lollis|    PF| 21267|   45|Jeannetta Golden_...|
| 29|Female|    Elenore Choy|    DB| 32877|   29|Billi Clore_Mitzi...|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   41|Claude Panos_Judi...|
| 28|  Male|  Margene Moores|   MVC| 52771|   32|Toshiko Hillyard_...|
| 28|  Male|     Neda Briski|   OOP| 61973|   69|Alberta Freund_El...|
| 28|Female|    Claude Panos| Cloud| 72409|   85|Sheryll Towler_Al...|
| 28|  Male|  Celeste Lollis|   MVC| 81492|   64|Nicole Harwood_Cl...|
| 29|  Male|  Cordie Harnois|   OOP| 92882|   51|Judie Chipps_Clem...|
| 29|Female|       Kena Wild|   DSA|102285|   35|Dustin Feagins_Ma...|
| 29| 

In [115]:
#renomenando uma coluna
df = df.withColumnRenamed("gender","sex").withColumnRenamed("roll", "Roll namber")
df.show()

+---+------+----------------+------+-----------+-----+--------------------+
|age|   sex|            name|course|Roll namber|marks|               email|
+---+------+----------------+------+-----------+-----+--------------------+
| 28|Female| Hubert Oliveras|    DB|       2984|   59|Annika Hoffman_Na...|
| 29|Female|Toshiko Hillyard| Cloud|      12899|   62|Margene Moores_Ma...|
| 28|  Male|  Celeste Lollis|    PF|      21267|   45|Jeannetta Golden_...|
| 29|Female|    Elenore Choy|    DB|      32877|   29|Billi Clore_Mitzi...|
| 28|  Male|  Sheryll Towler|   DSA|      41487|   41|Claude Panos_Judi...|
| 28|  Male|  Margene Moores|   MVC|      52771|   32|Toshiko Hillyard_...|
| 28|  Male|     Neda Briski|   OOP|      61973|   69|Alberta Freund_El...|
| 28|Female|    Claude Panos| Cloud|      72409|   85|Sheryll Towler_Al...|
| 28|  Male|  Celeste Lollis|   MVC|      81492|   64|Nicole Harwood_Cl...|
| 29|  Male|  Cordie Harnois|   OOP|      92882|   51|Judie Chipps_Clem...|
| 29|Female|

* utlizando o alias

In [118]:
df.select(col("name").alias("Full name")).show()

+----------------+
|       Full name|
+----------------+
| Hubert Oliveras|
|Toshiko Hillyard|
|  Celeste Lollis|
|    Elenore Choy|
|  Sheryll Towler|
|  Margene Moores|
|     Neda Briski|
|    Claude Panos|
|  Celeste Lollis|
|  Cordie Harnois|
|       Kena Wild|
| Ernest Rossbach|
|  Latia Vanhoose|
|  Latia Vanhoose|
|     Neda Briski|
|  Latia Vanhoose|
|  Loris Crossett|
|  Annika Hoffman|
|   Santa Kerfien|
|Mickey Cortright|
+----------------+
only showing top 20 rows

