In [None]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

In [None]:
!wget -q https://dlcdn.apache.org/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz

In [None]:
!tar xf spark-3.2.0-bin-hadoop3.2.tgz


In [None]:
!pip install -q findspark


In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.2.0-bin-hadoop3.2"

In [None]:
import findspark
findspark.init()


In [None]:
findspark.find()

'/content/spark-3.2.0-bin-hadoop3.2'

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder\
        .master("local")\
        .appName("Colab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()


To create an RDD from a list, use `parallelize()` method.

In [None]:
dataList = [("Java", 20000),("Python", 10000),("Scala", 2000)]
rdd=spark.sparkContext.parallelize(dataList)

In [None]:
rdd.collect()

[('Java', 20000), ('Python', 10000), ('Scala', 2000)]

To create an RDD from an external source such as a text file, use `textFile()` method.

In [None]:
rdd2=spark.sparkContext.textFile("/content/input.txt")

In [None]:
rdd2.collect()

['In business and in life, the most critical choices we make relate to people. Yet being a good judge of people is difficult. How do we get better at sizing up first impressions, at avoiding hiring mistakes, at correctly picking (and not missing) rising stars?',
 '',
 'The easy thing to do is focus on extrinsic markers — academic scores, net worth, social status, job titles. Social media has allowed us to add new layers of extrinsic scoring: How many friends do they have on Facebook? Who do we know in common through LinkedIn? How many Twitter followers do they have?',
 '',
 'But such extrinsic credentials and markers only tell one part of a person’s story. They are necessary, but not sufficient. What they miss are the “softer” and more nuanced intrinsic that are far more defining of a person’s character. You can teach skills; character and attitude, not so much.',
 '',
 'Judging on extrinsic and skill-based factors is a relatively objective and straightforward exercise. Gauging softer 

# **DataFrame**
Simplest way to create a dataframe is by using Python `list`.


1.   using `createDataFrame()` method.
2.   from external data sources.
        * Files from local system
        * HDFS
        * SQL table

  To read a csv from a local system, use 

      `df = spark.read.csv("/path/file.csv")`
      
      `df.printSchema()`









In [None]:
data = [('Mathew','Sebastian','M','50000'),('Smith','Jones','M','60000'),('Franky','','M','50000'),('Julie','steves','F','70000'),('Ancy','Mathew','F','60000')]
columns = ["FirstName","LastName","Gender","Salary"]
df = spark.createDataFrame(data=data, schema = columns)

To get the schema of data frame, use `df.printSchema()`.
(Since, data frames are structure format with has names and column)

In [None]:
df.printSchema()

root
 |-- FirstName: string (nullable = true)
 |-- LastName: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Salary: string (nullable = true)



To see the elements of the created data frame, use `df.show()`.

In [None]:
df.show()

+---------+---------+------+------+
|FirstName| LastName|Gender|Salary|
+---------+---------+------+------+
|   Mathew|Sebastian|     M| 50000|
|    Smith|    Jones|     M| 60000|
|   Franky|         |     M| 50000|
|    Julie|   steves|     F| 70000|
|     Ancy|   Mathew|     F| 60000|
+---------+---------+------+------+



Spark `show()` method takes several arguments to fetch number of rows & get full column value:


1.   `df.show(4)` //shows only top 4 rows
2.   `df.show(truncate=False)` //shows all the rows with full column values
3.   `df.show(4,truncate=False)` //shows only top 4 rows with full column values
4.   `df.show(3,truncate=3,vertical=True)` //shows top 3 rows with full column   values in a vertical fashion.



In [None]:
df.show(4)

+---------+---------+------+------+
|FirstName| LastName|Gender|Salary|
+---------+---------+------+------+
|   Mathew|Sebastian|     M| 50000|
|    Smith|    Jones|     M| 60000|
|   Franky|         |     M| 50000|
|    Julie|   steves|     F| 70000|
+---------+---------+------+------+
only showing top 4 rows



In [None]:
df.show(truncate=False)

+---------+---------+------+------+
|FirstName|LastName |Gender|Salary|
+---------+---------+------+------+
|Mathew   |Sebastian|M     |50000 |
|Smith    |Jones    |M     |60000 |
|Franky   |         |M     |50000 |
|Julie    |steves   |F     |70000 |
|Ancy     |Mathew   |F     |60000 |
+---------+---------+------+------+



In [None]:
df.show(4,truncate=False)

+---------+---------+------+------+
|FirstName|LastName |Gender|Salary|
+---------+---------+------+------+
|Mathew   |Sebastian|M     |50000 |
|Smith    |Jones    |M     |60000 |
|Franky   |         |M     |50000 |
|Julie    |steves   |F     |70000 |
+---------+---------+------+------+
only showing top 4 rows



In [None]:
df.show(3,truncate=3,vertical=True)

-RECORD 0--------
 FirstName | Mat 
 LastName  | Seb 
 Gender    | M   
 Salary    | 500 
-RECORD 1--------
 FirstName | Smi 
 LastName  | Jon 
 Gender    | M   
 Salary    | 600 
-RECORD 2--------
 FirstName | Fra 
 LastName  |     
 Gender    | M   
 Salary    | 500 
only showing top 3 rows



# Using `withColumn()` fuction in  PySpark
`withColumn()` function is a transformation function of data frame which is used to change the value of an existing column, covert the data type of an existing column, createa new column etc.





1.   Changing the data type of an existing column: In order to change the data type of an existing column you need to use `cast()` function long with the `withColumn()` function.




In [None]:
from pyspark.sql.functions import col, lit
df.withColumn("Salary",col("Salary").cast("Integer"))


DataFrame[FirstName: string, LastName: string, Gender: string, Salary: int]

In [None]:
df.show()

+---------+---------+------+------+
|FirstName| LastName|Gender|Salary|
+---------+---------+------+------+
|   Mathew|Sebastian|     M| 50000|
|    Smith|    Jones|     M| 60000|
|   Franky|         |     M| 50000|
|    Julie|   steves|     F| 70000|
|     Ancy|   Mathew|     F| 60000|
+---------+---------+------+------+



2. Updating the value of an existing column

In [None]:
df.withColumn("Salary",col("Salary")*2).show()

+---------+---------+------+--------+
|FirstName| LastName|Gender|  Salary|
+---------+---------+------+--------+
|   Mathew|Sebastian|     M|100000.0|
|    Smith|    Jones|     M|120000.0|
|   Franky|         |     M|100000.0|
|    Julie|   steves|     F|140000.0|
|     Ancy|   Mathew|     F|120000.0|
+---------+---------+------+--------+



3. Updating the column based on a condition:to update a column value based on a condition by we have to use `When`  and `Otherwise`

In [None]:
from pyspark.sql.functions import when
df1=df.withColumn("Gender",when(df.Gender == "M", "Male")\
              .when(df.Gender == "F", "Female")\
              .otherwise(df.Gender))
df1.show()

              

+---------+---------+------+------+
|FirstName| LastName|Gender|Salary|
+---------+---------+------+------+
|   Mathew|Sebastian|  Male| 50000|
|    Smith|    Jones|  Male| 60000|
|   Franky|         |  Male| 50000|
|    Julie|   steves|Female| 70000|
|     Ancy|   Mathew|Female| 60000|
+---------+---------+------+------+



3. Adding a new column into the data frame: In order to create a new column, pass the column name you wanted to the first argument of `withColumn()` transformation function.

In [None]:
df1.withColumn("Country", lit("USA")).show()

+---------+---------+------+------+-------+
|FirstName| LastName|Gender|Salary|Country|
+---------+---------+------+------+-------+
|   Mathew|Sebastian|  Male| 50000|    USA|
|    Smith|    Jones|  Male| 60000|    USA|
|   Franky|         |  Male| 50000|    USA|
|    Julie|   steves|Female| 70000|    USA|
|     Ancy|   Mathew|Female| 60000|    USA|
+---------+---------+------+------+-------+



In [None]:
df1.drop("Country").show()

+---------+---------+------+------+
|FirstName| LastName|Gender|Salary|
+---------+---------+------+------+
|   Mathew|Sebastian|  Male| 50000|
|    Smith|    Jones|  Male| 60000|
|   Franky|         |  Male| 50000|
|    Julie|   steves|Female| 70000|
|     Ancy|   Mathew|Female| 60000|
+---------+---------+------+------+



In [None]:
df1.withColumn("Country", lit("USA")).show()

+---------+---------+------+------+-------+
|FirstName| LastName|Gender|Salary|Country|
+---------+---------+------+------+-------+
|   Mathew|Sebastian|  Male| 50000|    USA|
|    Smith|    Jones|  Male| 60000|    USA|
|   Franky|         |  Male| 50000|    USA|
|    Julie|   steves|Female| 70000|    USA|
|     Ancy|   Mathew|Female| 60000|    USA|
+---------+---------+------+------+-------+



# To filter rows based on a given condition
PySpark `filter()` function is used to filter the rows from RDD/DataFrame based on the given condition or SQL expression, you can also use `where()` clause.

In [None]:
df1.filter("gender == 'Male'").show()

+---------+---------+------+------+
|FirstName| LastName|Gender|Salary|
+---------+---------+------+------+
|   Mathew|Sebastian|  Male| 50000|
|    Smith|    Jones|  Male| 60000|
|   Franky|         |  Male| 50000|
+---------+---------+------+------+



In [None]:
df1.filter("Gender != 'Male'").show()

+---------+--------+------+------+
|FirstName|LastName|Gender|Salary|
+---------+--------+------+------+
|    Julie|  steves|Female| 70000|
|     Ancy|  Mathew|Female| 60000|
+---------+--------+------+------+



In [None]:
df1.filter("Gender <> 'Male'").show()

+---------+--------+------+------+
|FirstName|LastName|Gender|Salary|
+---------+--------+------+------+
|    Julie|  steves|Female| 70000|
|     Ancy|  Mathew|Female| 60000|
+---------+--------+------+------+



`Filter()` with multiple conditions

In [None]:
df1.withColumn("Country", lit("USA")).show()

+---------+---------+------+------+-------+
|FirstName| LastName|Gender|Salary|Country|
+---------+---------+------+------+-------+
|   Mathew|Sebastian|  Male| 50000|    USA|
|    Smith|    Jones|  Male| 60000|    USA|
|   Franky|         |  Male| 50000|    USA|
|    Julie|   steves|Female| 70000|    USA|
|     Ancy|   Mathew|Female| 60000|    USA|
+---------+---------+------+------+-------+



In [None]:
df1.withColumn("Country", lit("USA")).show()

+---------+---------+------+------+-------+
|FirstName| LastName|Gender|Salary|Country|
+---------+---------+------+------+-------+
|   Mathew|Sebastian|  Male| 50000|    USA|
|    Smith|    Jones|  Male| 60000|    USA|
|   Franky|         |  Male| 50000|    USA|
|    Julie|   steves|Female| 70000|    USA|
|     Ancy|   Mathew|Female| 60000|    USA|
+---------+---------+------+------+-------+



In [None]:
df1.filter(df1.Gender == "Male").show()

+---------+---------+------+------+
|FirstName| LastName|Gender|Salary|
+---------+---------+------+------+
|   Mathew|Sebastian|  Male| 50000|
|    Smith|    Jones|  Male| 60000|
|   Franky|         |  Male| 50000|
+---------+---------+------+------+



In [None]:
df1.withColumn("Department", lit("Finance")) \
.show()

+---------+---------+------+------+----------+
|FirstName| LastName|Gender|Salary|Department|
+---------+---------+------+------+----------+
|   Mathew|Sebastian|  Male| 50000|   Finance|
|    Smith|    Jones|  Male| 60000|   Finance|
|   Franky|         |  Male| 50000|   Finance|
|    Julie|   steves|Female| 70000|   Finance|
|     Ancy|   Mathew|Female| 60000|   Finance|
+---------+---------+------+------+----------+



In [None]:
df2=df1.withColumn("Department", lit("Finance"))

In [None]:
df2.show()

+---------+---------+------+------+----------+
|FirstName| LastName|Gender|Salary|Department|
+---------+---------+------+------+----------+
|   Mathew|Sebastian|  Male| 50000|   Finance|
|    Smith|    Jones|  Male| 60000|   Finance|
|   Franky|         |  Male| 50000|   Finance|
|    Julie|   steves|Female| 70000|   Finance|
|     Ancy|   Mathew|Female| 60000|   Finance|
+---------+---------+------+------+----------+



PySpark `orderBy()` and `sort()`

In [None]:

df2.sort("Gender","Department").show(truncate=False)
df2.sort(col("Gender"),col("Department")).show(truncate=False)


+---------+---------+------+------+----------+
|FirstName|LastName |Gender|Salary|Department|
+---------+---------+------+------+----------+
|Julie    |steves   |Female|70000 |Finance   |
|Ancy     |Mathew   |Female|60000 |Finance   |
|Mathew   |Sebastian|Male  |50000 |Finance   |
|Smith    |Jones    |Male  |60000 |Finance   |
|Franky   |         |Male  |50000 |Finance   |
+---------+---------+------+------+----------+

+---------+---------+------+------+----------+
|FirstName|LastName |Gender|Salary|Department|
+---------+---------+------+------+----------+
|Julie    |steves   |Female|70000 |Finance   |
|Ancy     |Mathew   |Female|60000 |Finance   |
|Mathew   |Sebastian|Male  |50000 |Finance   |
|Smith    |Jones    |Male  |60000 |Finance   |
|Franky   |         |Male  |50000 |Finance   |
+---------+---------+------+------+----------+



In [None]:

df2.sort(df2.Salary.asc(),df2.Department.asc()).show(truncate=False)


+---------+---------+------+------+----------+
|FirstName|LastName |Gender|Salary|Department|
+---------+---------+------+------+----------+
|Mathew   |Sebastian|Male  |50000 |Finance   |
|Franky   |         |Male  |50000 |Finance   |
|Smith    |Jones    |Male  |60000 |Finance   |
|Ancy     |Mathew   |Female|60000 |Finance   |
|Julie    |steves   |Female|70000 |Finance   |
+---------+---------+------+------+----------+



In [None]:
df2.sort(df2.Salary.desc(),df2.Department.desc()).show(truncate=False)

+---------+---------+------+------+----------+
|FirstName|LastName |Gender|Salary|Department|
+---------+---------+------+------+----------+
|Julie    |steves   |Female|70000 |Finance   |
|Ancy     |Mathew   |Female|60000 |Finance   |
|Smith    |Jones    |Male  |60000 |Finance   |
|Mathew   |Sebastian|Male  |50000 |Finance   |
|Franky   |         |Male  |50000 |Finance   |
+---------+---------+------+------+----------+



# PySpark Explode arrays & map columns to rows


In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder\
        .master("local")\
        .appName("Colab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()


Before we start, let’s create a DataFrame with array and map fields

In [None]:
arrayData = [
             ('James',['Java','Python'],{'Hair':'Black','Eyes':'Brown'}),
             ('Rani',['Scala','Java','Python'],{'Hair':'Brown','Eyes':'Black'}),
             ('Mathew',['C','Java'],{'Hair':'Black','Eyes':'Blue'}),
             ('Ancy',['Python','C','Java'],{'Hair':'Brown','Eyes':'Black'})

]
df = spark.createDataFrame(data=arrayData, schema = ['Name','LanguageKnown','Properties'])
df.printSchema()
df.show()

root
 |-- Name: string (nullable = true)
 |-- LanguageKnown: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- Properties: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

+------+--------------------+--------------------+
|  Name|       LanguageKnown|          Properties|
+------+--------------------+--------------------+
| James|      [Java, Python]|{Hair -> Black, E...|
|  Rani|[Scala, Java, Pyt...|{Hair -> Brown, E...|
|Mathew|           [C, Java]|{Hair -> Black, E...|
|  Ancy|   [Python, C, Java]|{Hair -> Brown, E...|
+------+--------------------+--------------------+



PySpark function `explode(e: Column)` is used to explode or create array or map columns to rows. When an array is passed to this function, it creates a new default column `“col1”` and it contains all array elements. When a map is passed, it creates two new columns one for key and one for value and each element in map split into the rows

**Explode - Array column**

In [None]:
from pyspark.sql.functions import explode
df1 = df.select(df.Name, explode(df.LanguageKnown))
df1.printSchema()
df1.show()

root
 |-- Name: string (nullable = true)
 |-- col: string (nullable = true)

+------+------+
|  Name|   col|
+------+------+
| James|  Java|
| James|Python|
|  Rani| Scala|
|  Rani|  Java|
|  Rani|Python|
|Mathew|     C|
|Mathew|  Java|
|  Ancy|Python|
|  Ancy|     C|
|  Ancy|  Java|
+------+------+



**Explode - Map column**

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder\
        .master("local")\
        .appName("Colab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()


In [None]:
arrayData = [
             ('James',['Java','Python'],{'Hair':'Black','Eyes':'Brown'}),
             ('Rani',['Scala','Java','Python'],{'Hair':'Brown','Eyes':'Black'}),
             ('Mathew',['C','Java'],{'Hair':'Black','Eyes':'Blue'}),
             ('Ancy',['Python','C','Java'],{'Hair':'Brown','Eyes':'Black'})

]
df = spark.createDataFrame(data=arrayData, schema = ['Name','LanguageKnown','Properties'])
df.printSchema()
df.show()

root
 |-- Name: string (nullable = true)
 |-- LanguageKnown: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- Properties: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

+------+--------------------+--------------------+
|  Name|       LanguageKnown|          Properties|
+------+--------------------+--------------------+
| James|      [Java, Python]|{Hair -> Black, E...|
|  Rani|[Scala, Java, Pyt...|{Hair -> Brown, E...|
|Mathew|           [C, Java]|{Hair -> Black, E...|
|  Ancy|   [Python, C, Java]|{Hair -> Brown, E...|
+------+--------------------+--------------------+



In [None]:
from pyspark.sql.functions import explode
df1 = df.select(df.Name, explode(df.Properties))
df1.printSchema()
df1.show()

root
 |-- Name: string (nullable = true)
 |-- key: string (nullable = false)
 |-- value: string (nullable = true)

+------+----+-----+
|  Name| key|value|
+------+----+-----+
| James|Hair|Black|
| James|Eyes|Brown|
|  Rani|Hair|Brown|
|  Rani|Eyes|Black|
|Mathew|Hair|Black|
|Mathew|Eyes| Blue|
|  Ancy|Hair|Brown|
|  Ancy|Eyes|Black|
+------+----+-----+



PySpark Aggregate function
PySpark SQL Aggregate functions are grouped as `“agg_funcs”` in Pyspark.

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder\
        .master("local")\
        .appName("Colab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()


In [None]:

simpleData = [("James", "Sales", 3000),
    ("Michael", "Sales", 4600),
    ("Robert", "Sales", 4100),
    ("Maria", "Finance", 3000),
    ("James", "Sales", 3000),
    ("Scott", "Finance", 3300),
    ("Jen", "Finance", 3900),
    ("Jeff", "Marketing", 3000),
    ("Kumar", "Marketing", 2000),
    ("Saif", "Sales", 4100)
  ]
schema = ["employee_name", "department", "salary"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.printSchema()
df.show()


root
 |-- employee_name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- salary: long (nullable = true)

+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
|        James|     Sales|  3000|
|      Michael|     Sales|  4600|
|       Robert|     Sales|  4100|
|        Maria|   Finance|  3000|
|        James|     Sales|  3000|
|        Scott|   Finance|  3300|
|          Jen|   Finance|  3900|
|         Jeff| Marketing|  3000|
|        Kumar| Marketing|  2000|
|         Saif|     Sales|  4100|
+-------------+----------+------+

