<a href="https://colab.research.google.com/github/abelsare348/codes/blob/pyspark/Pyspark/data_transformations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
pip install pyspark

In [3]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName("Spark_Sample").master("local").getOrCreate()

In [7]:
from pyspark.sql.types import StructType,StructField,StringType,IntegerType


In [9]:
schema=StructType([StructField("id",IntegerType(),True),
                   StructField("gender",StringType(),True),
                   StructField("Name",StringType(),True)])

In [41]:
df=spark.read.option("header",True).schema(schema).csv("/content/MOCK_DATA.csv")
df.show()

+---+----------+--------------------+
| id|    gender|                Name|
+---+----------+--------------------+
|  1|      Male|     Brion Steckings|
|  2|      Male|      Pembroke Cassy|
|  3|    Female|       Wylma Ornells|
|  4|      Male|   Oberon Plackstone|
|  5|      Male|      Jeno Yakuntsov|
|  6|    Female|          Melly Gyde|
|  7|      Male|    Brant Avrahamoff|
|  8|    Female|         Kathie Codd|
|  9|  Bigender|        Pen Armstead|
| 10|      Male|           Alf Krahl|
| 11|Polygender|  Christen Waterhous|
| 12|    Female|     Elmira Sheering|
| 13|      Male|  Zacharia Mulcaster|
| 14|    Female|      Robinet Sanson|
| 15|      Male|     Elijah Bachmann|
| 16|  Bigender|Bartholomew Rubin...|
| 17|    Female|    Verine Vondrasek|
| 18|    Female|    Susanna Kennally|
| 19|      Male|    Lazarus Bondesen|
| 20|    Female|    Christel Vallens|
+---+----------+--------------------+
only showing top 20 rows



In [42]:
df.printSchema()

root
 |-- id: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- Name: string (nullable = true)



In [43]:
from pyspark.sql import functions as F
df.select(F.max(F.col("id"))).show()

+-------+
|max(id)|
+-------+
|    100|
+-------+



In [14]:
Employee_schema=StructType([StructField("id",IntegerType(),True),
                            StructField("firstname",StringType(),True),
                            StructField("lastname",StringType(),True),
                            StructField("email",StringType(),True)])

Whenever you're specifying external schema, then specify it through schema parameter <b>not</b> option("schema",schema)

In [45]:
df2=spark.read.option("header",True).schema(Employee_schema).csv("/content/Employee.csv")
df2.show()

+---+---------+-----------+--------------------+
| id|firstname|   lastname|               email|
+---+---------+-----------+--------------------+
|  1|Cassandre|    Larkcum|   clarkcum0@tiny.cc|
|  2|    Merna|     Philip|mphilip1@histats.com|
|  3| Carolann|      Small|csmall2@cargocoll...|
|  4|  Valaree|   Reidshaw|vreidshaw3@china....|
|  5|     Susy|    Keesman|skeesman4@altervi...|
|  6|   Nealon|      Lydon|     nlydon5@ihg.com|
|  7|  Ceciley|  Hillborne|chillborne6@fastc...|
|  8| Jefferey|Thorneywork|jthorneywork7@pag...|
|  9|  Myrtice|    Rossant|mrossant8@edublog...|
| 10| Marieann|    Cettell|  mcettell9@ox.ac.uk|
| 11|     Mace|   Aronsohn|  maronsohna@msn.com|
| 12| Thorsten|   McKeever|tmckeeverb@market...|
| 13|   Melesa|Brotheridge|mbrotheridgec@ama...|
| 14|   Carney|  Pottiphar|cpottiphard@click...|
| 15|Rafaelita|     Keasey|  rkeaseye@google.nl|
| 16|    Kelci|    Kyngdon|kkyngdonf@cdbaby.com|
| 17|  Harriot|   McNickle|hmcnickleg@chicag...|
| 18|    Andie|     

Join in pyspark

In [47]:
df3=df.join(df2,df.id==df2.id,"inner") \
.select(df.id,df.gender,df2.firstname,df2.lastname,df2.email)
df3.show(5)

+---+------+---------+--------+--------------------+
| id|gender|firstname|lastname|               email|
+---+------+---------+--------+--------------------+
|  1|  Male|Cassandre| Larkcum|   clarkcum0@tiny.cc|
|  2|  Male|    Merna|  Philip|mphilip1@histats.com|
|  3|Female| Carolann|   Small|csmall2@cargocoll...|
|  4|  Male|  Valaree|Reidshaw|vreidshaw3@china....|
|  5|  Male|     Susy| Keesman|skeesman4@altervi...|
+---+------+---------+--------+--------------------+
only showing top 5 rows



Finding out males in datasets

In [48]:
df3.filter("gender='Male'").groupBy().count().show()

+-----+
|count|
+-----+
|   45|
+-----+



Getting the counts of employees based on Gender

In [49]:
df3.groupBy("gender").count().show()

+----------+-----+
|    gender|count|
+----------+-----+
|   Agender|    1|
|    Female|   48|
|Polygender|    2|
|  Bigender|    3|
|Non-binary|    1|
|      Male|   45|
+----------+-----+



In [50]:
df3.orderBy(df3.id.asc()).show(20)

+---+----------+---------+-----------+--------------------+
| id|    gender|firstname|   lastname|               email|
+---+----------+---------+-----------+--------------------+
|  1|      Male|Cassandre|    Larkcum|   clarkcum0@tiny.cc|
|  2|      Male|    Merna|     Philip|mphilip1@histats.com|
|  3|    Female| Carolann|      Small|csmall2@cargocoll...|
|  4|      Male|  Valaree|   Reidshaw|vreidshaw3@china....|
|  5|      Male|     Susy|    Keesman|skeesman4@altervi...|
|  6|    Female|   Nealon|      Lydon|     nlydon5@ihg.com|
|  7|      Male|  Ceciley|  Hillborne|chillborne6@fastc...|
|  8|    Female| Jefferey|Thorneywork|jthorneywork7@pag...|
|  9|  Bigender|  Myrtice|    Rossant|mrossant8@edublog...|
| 10|      Male| Marieann|    Cettell|  mcettell9@ox.ac.uk|
| 11|Polygender|     Mace|   Aronsohn|  maronsohna@msn.com|
| 12|    Female| Thorsten|   McKeever|tmckeeverb@market...|
| 13|      Male|   Melesa|Brotheridge|mbrotheridgec@ama...|
| 14|    Female|   Carney|  Pottiphar|cp