# PySpark Preprocessing (with COVID-19 Dataset)

In [1]:
import pandas as pd
import numpy as np
from datetime import date, timedelta, datetime
import time
import warnings
warnings.filterwarnings('ignore')

import pyspark 
from pyspark.sql import SparkSession, SQLContext
from pyspark.context import SparkContext
from pyspark.sql.functions import * 
from pyspark.sql.types import * 

In [2]:
# Initiate the Spark Session
app_name = "covid19_india"
master = "local[*]"
spark = SparkSession\
        .builder\
        .appName(app_name)\
        .master(master)\
        .config("spark.ui.port","42229")\
        .getOrCreate()
sc = spark.sparkContext

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/05/25 06:11:21 INFO org.apache.spark.SparkEnv: Registering MapOutputTracker
22/05/25 06:11:21 INFO org.apache.spark.SparkEnv: Registering BlockManagerMaster
22/05/25 06:11:21 INFO org.apache.spark.SparkEnv: Registering BlockManagerMasterHeartbeat
22/05/25 06:11:21 INFO org.apache.spark.SparkEnv: Registering OutputCommitCoordinator


In [3]:
spark

## Data
 This dataset has information from the states and union territories of India at daily level.


State level data comes from Ministry of Health & Family Welfare


Acknowledgements

Thanks to Indian Ministry of Health & Family Welfare for making the data available to general public.


Thanks to covid19india.org for making the individual level details, testing details, vaccination details available to general public.

Data can be found in this kaggle URL https://www.kaggle.com/datasets/sudalairajkumar/covid19-in-india

### 1. Basic Functions

#### [1] Load (Read) the data

In [6]:
# Change the path according to your google cloud bucket
cases = spark.read.load("gs://dataproc-staging-asia-east2-441991837520-q2hvd2t0/notebooks/jupyter/covid_19_india.csv",
                        format="csv", 
                        sep=",", 
                        inferSchema="true", 
                        header="true")

                                                                                

In [7]:
# First few rows in the file
cases.show()

+---+----------+-------+--------------------+-----------------------+------------------------+-----+------+---------+
|Sno|      Date|   Time|State/UnionTerritory|ConfirmedIndianNational|ConfirmedForeignNational|Cured|Deaths|Confirmed|
+---+----------+-------+--------------------+-----------------------+------------------------+-----+------+---------+
|  1|2020-01-30|6:00 PM|              Kerala|                      1|                       0|    0|     0|        1|
|  2|2020-01-31|6:00 PM|              Kerala|                      1|                       0|    0|     0|        1|
|  3|2020-02-01|6:00 PM|              Kerala|                      2|                       0|    0|     0|        2|
|  4|2020-02-02|6:00 PM|              Kerala|                      3|                       0|    0|     0|        3|
|  5|2020-02-03|6:00 PM|              Kerala|                      3|                       0|    0|     0|        3|
|  6|2020-02-04|6:00 PM|              Kerala|           

It looks ok right now, but sometimes as we the number of columns increases, the formatting becomes not too great. I have noticed that the following trick helps in displaying in pandas format in my Jupyter Notebook. 

The **.toPandas()** function converts a **Spark Dataframe** into a **Pandas Dataframe**, which is much easier to play with.

In [8]:
cases.limit(10).toPandas()

Unnamed: 0,Sno,Date,Time,State/UnionTerritory,ConfirmedIndianNational,ConfirmedForeignNational,Cured,Deaths,Confirmed
0,1,2020-01-30,6:00 PM,Kerala,1,0,0,0,1
1,2,2020-01-31,6:00 PM,Kerala,1,0,0,0,1
2,3,2020-02-01,6:00 PM,Kerala,2,0,0,0,2
3,4,2020-02-02,6:00 PM,Kerala,3,0,0,0,3
4,5,2020-02-03,6:00 PM,Kerala,3,0,0,0,3
5,6,2020-02-04,6:00 PM,Kerala,3,0,0,0,3
6,7,2020-02-05,6:00 PM,Kerala,3,0,0,0,3
7,8,2020-02-06,6:00 PM,Kerala,3,0,0,0,3
8,9,2020-02-07,6:00 PM,Kerala,3,0,0,0,3
9,10,2020-02-08,6:00 PM,Kerala,3,0,0,0,3


#### [2] Change Column Names

To change a single column,

In [9]:
cases = cases.withColumnRenamed("ConfirmedIndianNational","Confirmed_Indian_National")

To change all columns,

In [10]:
cases = cases.toDF(*['Sno', 'Date', 'Time', 'State/UnionTerritory', 'ConfirmedIndianNational', 'ConfirmedForeignNational',
       'Cured','Deaths', 'Confirmed'])

In [11]:
cases.show()

+---+----------+-------+--------------------+-----------------------+------------------------+-----+------+---------+
|Sno|      Date|   Time|State/UnionTerritory|ConfirmedIndianNational|ConfirmedForeignNational|Cured|Deaths|Confirmed|
+---+----------+-------+--------------------+-----------------------+------------------------+-----+------+---------+
|  1|2020-01-30|6:00 PM|              Kerala|                      1|                       0|    0|     0|        1|
|  2|2020-01-31|6:00 PM|              Kerala|                      1|                       0|    0|     0|        1|
|  3|2020-02-01|6:00 PM|              Kerala|                      2|                       0|    0|     0|        2|
|  4|2020-02-02|6:00 PM|              Kerala|                      3|                       0|    0|     0|        3|
|  5|2020-02-03|6:00 PM|              Kerala|                      3|                       0|    0|     0|        3|
|  6|2020-02-04|6:00 PM|              Kerala|           

#### [3] Change Column Names

We can select a subset of columns using the **select** 

In [12]:
cases = cases.select('Date','Time','State/UnionTerritory','Deaths')
cases.show()

+----------+-------+--------------------+------+
|      Date|   Time|State/UnionTerritory|Deaths|
+----------+-------+--------------------+------+
|2020-01-30|6:00 PM|              Kerala|     0|
|2020-01-31|6:00 PM|              Kerala|     0|
|2020-02-01|6:00 PM|              Kerala|     0|
|2020-02-02|6:00 PM|              Kerala|     0|
|2020-02-03|6:00 PM|              Kerala|     0|
|2020-02-04|6:00 PM|              Kerala|     0|
|2020-02-05|6:00 PM|              Kerala|     0|
|2020-02-06|6:00 PM|              Kerala|     0|
|2020-02-07|6:00 PM|              Kerala|     0|
|2020-02-08|6:00 PM|              Kerala|     0|
|2020-02-09|6:00 PM|              Kerala|     0|
|2020-02-10|6:00 PM|              Kerala|     0|
|2020-02-11|6:00 PM|              Kerala|     0|
|2020-02-12|6:00 PM|              Kerala|     0|
|2020-02-13|6:00 PM|              Kerala|     0|
|2020-02-14|6:00 PM|              Kerala|     0|
|2020-02-15|6:00 PM|              Kerala|     0|
|2020-02-16|6:00 PM|

#### [4] Sort by Column

In [13]:
# Simple sort
cases.sort("ConfirmedIndianNational").show()

+----------+-------+--------------------+------+
|      Date|   Time|State/UnionTerritory|Deaths|
+----------+-------+--------------------+------+
|2020-03-29|7:30 PM|              Kerala|     1|
|2020-03-30|9:30 PM|               Delhi|     2|
|2020-03-29|7:30 PM|              Ladakh|     0|
|2020-03-29|7:30 PM|      Andhra Pradesh|     0|
|2020-03-29|7:30 PM|      Madhya Pradesh|     2|
|2020-03-29|7:30 PM|               Bihar|     1|
|2020-03-29|7:30 PM|         Maharashtra|     6|
|2020-03-29|7:30 PM|        Chhattisgarh|     0|
|2020-03-29|7:30 PM|             Manipur|     0|
|2020-03-29|7:30 PM|                 Goa|     0|
|2020-03-29|7:30 PM|             Mizoram|     0|
|2020-03-29|7:30 PM|             Haryana|     0|
|2020-03-29|7:30 PM|              Odisha|     0|
|2020-03-29|7:30 PM|   Jammu and Kashmir|     2|
|2020-03-29|7:30 PM|          Puducherry|     0|
|2020-03-29|7:30 PM|           Telengana|     1|
|2020-03-30|9:30 PM|        Chhattisgarh|     0|
|2020-03-29|7:30 PM|

In [14]:
# Descending Sort
from pyspark.sql import functions as F

cases.sort(F.desc("ConfirmedIndianNational")).show()

+----------+-------+--------------------+------+
|      Date|   Time|State/UnionTerritory|Deaths|
+----------+-------+--------------------+------+
|2020-03-25|6:00 PM|         West Bengal|     1|
|2020-03-18|6:00 PM|               Delhi|     1|
|2020-03-28|6:00 PM|               Bihar|     1|
|2020-03-24|6:00 PM|         West Bengal|     1|
|2020-03-09|6:00 PM|              Kerala|     0|
|2020-03-11|6:00 PM|       Uttar Pradesh|     0|
|2020-03-25|6:00 PM|      Andhra Pradesh|     0|
|2020-03-24|6:00 PM|              Kerala|     0|
|2020-03-24|6:00 PM|         Maharashtra|     2|
|2020-03-18|6:00 PM|              Ladakh|     0|
|2020-03-19|6:00 PM|              Ladakh|     0|
|2020-03-20|6:00 PM|           Telengana|     0|
|2020-03-28|6:00 PM|          Chandigarh|     0|
|2020-03-08|6:00 PM|              Kerala|     0|
|2020-03-24|6:00 PM|      Andhra Pradesh|     0|
|2020-03-17|6:00 PM|               Delhi|     1|
|2020-03-23|6:00 PM|         Maharashtra|     2|
|2020-03-08|6:00 PM|

#### [5] Change Column Type

In [15]:
cases.show()

+----------+-------+--------------------+------+
|      Date|   Time|State/UnionTerritory|Deaths|
+----------+-------+--------------------+------+
|2020-01-30|6:00 PM|              Kerala|     0|
|2020-01-31|6:00 PM|              Kerala|     0|
|2020-02-01|6:00 PM|              Kerala|     0|
|2020-02-02|6:00 PM|              Kerala|     0|
|2020-02-03|6:00 PM|              Kerala|     0|
|2020-02-04|6:00 PM|              Kerala|     0|
|2020-02-05|6:00 PM|              Kerala|     0|
|2020-02-06|6:00 PM|              Kerala|     0|
|2020-02-07|6:00 PM|              Kerala|     0|
|2020-02-08|6:00 PM|              Kerala|     0|
|2020-02-09|6:00 PM|              Kerala|     0|
|2020-02-10|6:00 PM|              Kerala|     0|
|2020-02-11|6:00 PM|              Kerala|     0|
|2020-02-12|6:00 PM|              Kerala|     0|
|2020-02-13|6:00 PM|              Kerala|     0|
|2020-02-14|6:00 PM|              Kerala|     0|
|2020-02-15|6:00 PM|              Kerala|     0|
|2020-02-16|6:00 PM|

In [16]:
from pyspark.sql.types import DoubleType, IntegerType, StringType

cases = cases.withColumn('Deaths', F.col('Deaths').cast(IntegerType()))
cases = cases.withColumn('State/UnionTerritory', F.col('State/UnionTerritory').cast(StringType()))

cases.show()

+----------+-------+--------------------+------+
|      Date|   Time|State/UnionTerritory|Deaths|
+----------+-------+--------------------+------+
|2020-01-30|6:00 PM|              Kerala|     0|
|2020-01-31|6:00 PM|              Kerala|     0|
|2020-02-01|6:00 PM|              Kerala|     0|
|2020-02-02|6:00 PM|              Kerala|     0|
|2020-02-03|6:00 PM|              Kerala|     0|
|2020-02-04|6:00 PM|              Kerala|     0|
|2020-02-05|6:00 PM|              Kerala|     0|
|2020-02-06|6:00 PM|              Kerala|     0|
|2020-02-07|6:00 PM|              Kerala|     0|
|2020-02-08|6:00 PM|              Kerala|     0|
|2020-02-09|6:00 PM|              Kerala|     0|
|2020-02-10|6:00 PM|              Kerala|     0|
|2020-02-11|6:00 PM|              Kerala|     0|
|2020-02-12|6:00 PM|              Kerala|     0|
|2020-02-13|6:00 PM|              Kerala|     0|
|2020-02-14|6:00 PM|              Kerala|     0|
|2020-02-15|6:00 PM|              Kerala|     0|
|2020-02-16|6:00 PM|

#### [6] Filter 

We can filter a data frame using multiple conditions using AND(&), OR(|) and NOT(~) conditions. For example, we may want to find out all the different infection_case in Daegu with more than 10 confirmed cases.

In [17]:
cases.filter((cases.Deaths>10) & (cases["State/UnionTerritory"]=='Delhi')).show()

+----------+-------+--------------------+------+
|      Date|   Time|State/UnionTerritory|Deaths|
+----------+-------+--------------------+------+
|2020-04-10|5:00 PM|               Delhi|    13|
|2020-04-11|5:00 PM|               Delhi|    14|
|2020-04-12|5:00 PM|               Delhi|    19|
|2020-04-13|5:00 PM|               Delhi|    24|
|2020-04-14|5:00 PM|               Delhi|    28|
|2020-04-15|5:00 PM|               Delhi|    30|
|2020-04-16|5:00 PM|               Delhi|    32|
|2020-04-17|5:00 PM|               Delhi|    38|
|2020-04-18|5:00 PM|               Delhi|    42|
|2020-04-19|5:00 PM|               Delhi|    43|
|2020-04-20|5:00 PM|               Delhi|    45|
|2020-04-21|5:00 PM|               Delhi|    47|
|2020-04-22|5:00 PM|               Delhi|    47|
|2020-04-23|5:00 PM|               Delhi|    48|
|2020-04-24|5:00 PM|               Delhi|    50|
|2020-04-25|5:00 PM|               Delhi|    53|
|2020-04-26|5:00 PM|               Delhi|    54|
|2020-04-27|5:00 PM|

#### [7] GroupBy

In [18]:
from pyspark.sql import functions as F

cases.groupBy(["Date","State/UnionTerritory"]).agg(F.sum("Deaths") ,F.max("Deaths")).show()

+----------+--------------------+-----------+-----------+
|      Date|State/UnionTerritory|sum(Deaths)|max(Deaths)|
+----------+--------------------+-----------+-----------+
|2020-02-22|              Kerala|          0|          0|
|2020-03-25|             Manipur|          0|          0|
|2020-03-31|         Maharashtra|          9|          9|
|2020-04-06|             Gujarat|         12|         12|
|2020-04-13|               Delhi|         24|         24|
|2020-05-05|             Tripura|          0|          0|
|2020-05-27|              Odisha|          7|          7|
|2020-06-03|         West Bengal|        335|        335|
|2020-06-07|              Kerala|         15|         15|
|2020-06-14|          Tamil Nadu|        397|        397|
|2020-06-27|   Arunachal Pradesh|          1|          1|
|2020-06-29|          Puducherry|         10|         10|
|2020-06-29|Cases being reass...|          0|          0|
|2020-07-14|             Manipur|          0|          0|
|2020-07-16|An

Or if we don’t like the new column names, we can use the **alias** keyword to rename columns in the agg command itself.

In [22]:
cases.groupBy(["Date","State/UnionTerritory"]).agg(
    F.sum("Deaths").alias("Total_Deaths"),\
    F.max("Deaths").alias("Max Deaths")\
    ).show()

+----------+--------------------+------------+----------+
|      Date|State/UnionTerritory|Total_Deaths|Max Deaths|
+----------+--------------------+------------+----------+
|2020-02-22|              Kerala|           0|         0|
|2020-03-25|             Manipur|           0|         0|
|2020-03-31|         Maharashtra|           9|         9|
|2020-04-06|             Gujarat|          12|        12|
|2020-04-13|               Delhi|          24|        24|
|2020-05-05|             Tripura|           0|         0|
|2020-05-27|              Odisha|           7|         7|
|2020-06-03|         West Bengal|         335|       335|
|2020-06-07|              Kerala|          15|        15|
|2020-06-14|          Tamil Nadu|         397|       397|
|2020-06-27|   Arunachal Pradesh|           1|         1|
|2020-06-29|          Puducherry|          10|        10|
|2020-06-29|Cases being reass...|           0|         0|
|2020-07-14|             Manipur|           0|         0|
|2020-07-16|An

#### [8] Joins

Here, We will go with the region file which contains region information such as elementary_school_count, elderly_population_ratio, etc.

In [23]:
# Change the path with your google bucket path
state_details = spark.read.load("gs://dataproc-staging-asia-east2-441991837520-q2hvd2t0/notebooks/jupyter/StatewiseTestingDetails.csv",
                          format="csv", 
                          sep=",", 
                          inferSchema="true", 
                          header="true")

state_details.limit(10).toPandas()

Unnamed: 0,Date,State,TotalSamples,Negative,Positive
0,2020-04-17,Andaman and Nicobar Islands,1403.0,1210.0,12.0
1,2020-04-24,Andaman and Nicobar Islands,2679.0,,27.0
2,2020-04-27,Andaman and Nicobar Islands,2848.0,,33.0
3,2020-05-01,Andaman and Nicobar Islands,3754.0,,33.0
4,2020-05-16,Andaman and Nicobar Islands,6677.0,,33.0
5,2020-05-19,Andaman and Nicobar Islands,6965.0,,33.0
6,2020-05-20,Andaman and Nicobar Islands,7082.0,,33.0
7,2020-05-21,Andaman and Nicobar Islands,7167.0,,33.0
8,2020-05-22,Andaman and Nicobar Islands,7263.0,,33.0
9,2020-05-23,Andaman and Nicobar Islands,7327.0,,33.0


In [24]:
state_details=state_details.withColumnRenamed("State","State/UnionTerritory")

In [25]:
# Left Join 'Case' with 'State_details' on State Column
cases = cases.join(state_details, ['State/UnionTerritory'],how='left')
cases.limit(10).toPandas()

Unnamed: 0,State/UnionTerritory,Date,Time,Deaths,Date.1,TotalSamples,Negative,Positive
0,Kerala,2020-01-30,6:00 PM,0,2021-08-10,28745545.0,,
1,Kerala,2020-01-30,6:00 PM,0,2021-08-09,28612776.0,,
2,Kerala,2020-01-30,6:00 PM,0,2021-08-08,28514136.0,,
3,Kerala,2020-01-30,6:00 PM,0,2021-08-07,28379940.0,,
4,Kerala,2020-01-30,6:00 PM,0,2021-08-06,28227419.0,,
5,Kerala,2020-01-30,6:00 PM,0,2021-08-05,28075527.0,,
6,Kerala,2020-01-30,6:00 PM,0,2021-08-04,27912151.0,,
7,Kerala,2020-01-30,6:00 PM,0,2021-08-03,27715059.0,,
8,Kerala,2020-01-30,6:00 PM,0,2021-08-02,27515603.0,,
9,Kerala,2020-01-30,6:00 PM,0,2021-08-01,27387700.0,,


### 2. Use SQL with DataFrames

We first register the cases dataframe to a temporary table cases_table on which we can run SQL operations. As you can see, the result of the SQL select statement is again a Spark Dataframe.

All complex SQL queries like GROUP BY, HAVING, AND ORDER BY clauses can be applied in 'Sql' function

In [26]:
cases=cases.drop("Positive")
cases=cases.drop("Negative")
cases=cases.drop("Date")
cases.registerTempTable('cases_table')
sqlcontext=SQLContext(spark)
newDF = sqlcontext.sql('select * from cases_table where Deaths > 50')
newDF.show()

+--------------------+-------+------+------------+
|State/UnionTerritory|   Time|Deaths|TotalSamples|
+--------------------+-------+------+------------+
|         Maharashtra|5:00 PM|    64| 4.9905065E7|
|         Maharashtra|5:00 PM|    64| 4.9725694E7|
|         Maharashtra|5:00 PM|    64| 4.9568519E7|
|         Maharashtra|5:00 PM|    64| 4.9372212E7|
|         Maharashtra|5:00 PM|    64| 4.9172531E7|
|         Maharashtra|5:00 PM|    64| 4.8962106E7|
|         Maharashtra|5:00 PM|    64| 4.8744201E7|
|         Maharashtra|5:00 PM|    64| 4.8532523E7|
|         Maharashtra|5:00 PM|    64| 4.8352467E7|
|         Maharashtra|5:00 PM|    64|  4.818535E7|
|         Maharashtra|5:00 PM|    64| 4.7967609E7|
|         Maharashtra|5:00 PM|    64| 4.7760862E7|
|         Maharashtra|5:00 PM|    64| 4.7559938E7|
|         Maharashtra|5:00 PM|    64| 4.7369757E7|
|         Maharashtra|5:00 PM|    64| 4.7176715E7|
|         Maharashtra|5:00 PM|    64| 4.6995122E7|
|         Maharashtra|5:00 PM| 

### 3. Create New Columns

There are many ways that you can use to create a column in a PySpark Dataframe.

#### [1] Using Spark Native Functions

We can use .withcolumn along with PySpark SQL functions to create a new column. In essence, you can find String functions, Date functions, and Math functions already implemented using Spark functions. Our first function, the F.col function gives us access to the column. So if we wanted to add 100 to a column, we could use F.col as:

In [27]:
import pyspark.sql.functions as F

casesWithNewConfirmed = cases.withColumn("New_Deaths", 10 + F.col("Deaths"))
casesWithNewConfirmed.show()

+--------------------+-------+------+------------+----------+
|State/UnionTerritory|   Time|Deaths|TotalSamples|New_Deaths|
+--------------------+-------+------+------------+----------+
|              Kerala|6:00 PM|     0| 2.8745545E7|        10|
|              Kerala|6:00 PM|     0| 2.8612776E7|        10|
|              Kerala|6:00 PM|     0| 2.8514136E7|        10|
|              Kerala|6:00 PM|     0|  2.837994E7|        10|
|              Kerala|6:00 PM|     0| 2.8227419E7|        10|
|              Kerala|6:00 PM|     0| 2.8075527E7|        10|
|              Kerala|6:00 PM|     0| 2.7912151E7|        10|
|              Kerala|6:00 PM|     0| 2.7715059E7|        10|
|              Kerala|6:00 PM|     0| 2.7515603E7|        10|
|              Kerala|6:00 PM|     0|   2.73877E7|        10|
|              Kerala|6:00 PM|     0|  2.721701E7|        10|
|              Kerala|6:00 PM|     0| 2.7049431E7|        10|
|              Kerala|6:00 PM|     0| 2.6896792E7|        10|
|       

We can also use math functions like F.exp function:

In [28]:
casesWithExpConfirmed = cases.withColumn("ExpConfirmed", F.exp("Deaths"))
casesWithExpConfirmed.show()

+--------------------+-------+------+------------+------------+
|State/UnionTerritory|   Time|Deaths|TotalSamples|ExpConfirmed|
+--------------------+-------+------+------------+------------+
|              Kerala|6:00 PM|     0| 2.8745545E7|         1.0|
|              Kerala|6:00 PM|     0| 2.8612776E7|         1.0|
|              Kerala|6:00 PM|     0| 2.8514136E7|         1.0|
|              Kerala|6:00 PM|     0|  2.837994E7|         1.0|
|              Kerala|6:00 PM|     0| 2.8227419E7|         1.0|
|              Kerala|6:00 PM|     0| 2.8075527E7|         1.0|
|              Kerala|6:00 PM|     0| 2.7912151E7|         1.0|
|              Kerala|6:00 PM|     0| 2.7715059E7|         1.0|
|              Kerala|6:00 PM|     0| 2.7515603E7|         1.0|
|              Kerala|6:00 PM|     0|   2.73877E7|         1.0|
|              Kerala|6:00 PM|     0|  2.721701E7|         1.0|
|              Kerala|6:00 PM|     0| 2.7049431E7|         1.0|
|              Kerala|6:00 PM|     0| 2.

#### [2] Using Spark UDFs

Sometimes we want to do complicated things to a column or multiple columns. This could be thought of as a map operation on a PySpark Dataframe to a single column or multiple columns. While Spark SQL functions do solve many use cases when it comes to column creation, I use Spark UDF whenever I need more matured Python functionality. \

To use Spark UDFs, we need to use the F.udf function to convert a regular python function to a Spark UDF. We also need to specify the return type of the function. In this example the return type is StringType()

In [29]:
import pyspark.sql.functions as F
from pyspark.sql.types import *

def casesHighLow(Deaths):
    if Deaths < 10: 
        return 'low'
    else:
        return 'high'
    
#convert to a UDF Function by passing in the function and return type of function
casesHighLowUDF = F.udf(casesHighLow, StringType())
CasesWithHighLow = cases.withColumn("HighLow", casesHighLowUDF("Deaths"))
CasesWithHighLow.show()

+--------------------+-------+------+------------+-------+
|State/UnionTerritory|   Time|Deaths|TotalSamples|HighLow|
+--------------------+-------+------+------------+-------+
|              Kerala|6:00 PM|     0| 2.8745545E7|    low|
|              Kerala|6:00 PM|     0| 2.8612776E7|    low|
|              Kerala|6:00 PM|     0| 2.8514136E7|    low|
|              Kerala|6:00 PM|     0|  2.837994E7|    low|
|              Kerala|6:00 PM|     0| 2.8227419E7|    low|
|              Kerala|6:00 PM|     0| 2.8075527E7|    low|
|              Kerala|6:00 PM|     0| 2.7912151E7|    low|
|              Kerala|6:00 PM|     0| 2.7715059E7|    low|
|              Kerala|6:00 PM|     0| 2.7515603E7|    low|
|              Kerala|6:00 PM|     0|   2.73877E7|    low|
|              Kerala|6:00 PM|     0|  2.721701E7|    low|
|              Kerala|6:00 PM|     0| 2.7049431E7|    low|
|              Kerala|6:00 PM|     0| 2.6896792E7|    low|
|              Kerala|6:00 PM|     0| 2.6733694E7|    lo

Traceback (most recent call last):                                              
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 186, in manager
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 74, in worker
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 643, in main
    if read_int(infile) == SpecialLengths.END_OF_STREAM:
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 564, in read_int
    raise EOFError
EOFError


#### [3] Using Pandas UDF

This allows you to use pandas functionality with Spark. I generally use it when I have to run a groupBy operation on a Spark dataframe or whenever I need to create rolling features
 
The way we use it is by using the F.pandas_udf decorator. **We assume here that the input to the function will be a pandas data frame**

The only complexity here is that we have to provide a schema for the output Dataframe. We can use the original schema of a dataframe to create the outSchema.

In [30]:
cases.printSchema()

root
 |-- State/UnionTerritory: string (nullable = true)
 |-- Time: string (nullable = true)
 |-- Deaths: integer (nullable = true)
 |-- TotalSamples: double (nullable = true)



In [31]:
from pyspark.sql.types import IntegerType, StringType, DoubleType, BooleanType
from pyspark.sql.types import StructType, StructField

# Declare the schema for the output of our function

outSchema = StructType([StructField('State/UnionTerritory',StringType(),True),
                        StructField('Time',StringType(),True),
                        StructField('Deaths',IntegerType(),True),
                        StructField('TotalSamples',DoubleType(),True),
                        StructField('normalized_deaths',DoubleType(),True)
                       ])
# decorate our function with pandas_udf decorator
@F.pandas_udf(outSchema, F.PandasUDFType.GROUPED_MAP)
def subtract_mean(pdf):
    # pdf is a pandas.DataFrame
    v = pdf.Deaths
    v = v - v.mean()
    pdf['normalized_deaths'] = v
    return pdf

confirmed_groupwise_normalization = cases.groupby("State/UnionTerritory").apply(subtract_mean)

confirmed_groupwise_normalization.limit(5).toPandas()

                                                                                

Unnamed: 0,State/UnionTerritory,Time,Deaths,TotalSamples,normalized_deaths
0,Chhattisgarh,6:00 PM,0,11762041.0,-4038.982387
1,Chhattisgarh,6:00 PM,0,11720007.0,-4038.982387
2,Chhattisgarh,6:00 PM,0,11692078.0,-4038.982387
3,Chhattisgarh,6:00 PM,0,11666597.0,-4038.982387
4,Chhattisgarh,6:00 PM,0,11624064.0,-4038.982387


### 4. Spark Window Functions

We will simply look at some of the most important and useful window functions available.

In [35]:
# Change the path with your google bucket path
state_date_time = spark.read.load("gs://dataproc-staging-asia-east2-441991837520-q2hvd2t0/notebooks/jupyter/StatewiseTestingDetails.csv",
                          format="csv", 
                          sep=",", 
                          inferSchema="true", 
                          header="true")

state_date_time.show()

+----------+--------------------+------------+--------+--------+
|      Date|               State|TotalSamples|Negative|Positive|
+----------+--------------------+------------+--------+--------+
|2020-04-17|Andaman and Nicob...|      1403.0|    1210|    12.0|
|2020-04-24|Andaman and Nicob...|      2679.0|    null|    27.0|
|2020-04-27|Andaman and Nicob...|      2848.0|    null|    33.0|
|2020-05-01|Andaman and Nicob...|      3754.0|    null|    33.0|
|2020-05-16|Andaman and Nicob...|      6677.0|    null|    33.0|
|2020-05-19|Andaman and Nicob...|      6965.0|    null|    33.0|
|2020-05-20|Andaman and Nicob...|      7082.0|    null|    33.0|
|2020-05-21|Andaman and Nicob...|      7167.0|    null|    33.0|
|2020-05-22|Andaman and Nicob...|      7263.0|    null|    33.0|
|2020-05-23|Andaman and Nicob...|      7327.0|    null|    33.0|
|2020-05-24|Andaman and Nicob...|      7327.0|    null|    33.0|
|2020-05-25|Andaman and Nicob...|      7363.0|    null|    33.0|
|2020-05-26|Andaman and N

#### Ranking

You can get rank as well as dense_rank on a group using this function. For example, you may want to have a column in your cases table that provides the rank of infection_case based on the number of infection_case in a province. We can do this by:

In [36]:
from pyspark.sql.window import Window
windowSpec = Window().partitionBy(['State/UnionTerritory']).orderBy(F.desc('Deaths'))
cases.withColumn("rank",F.rank().over(windowSpec)).show()

[Stage 49:>                                                         (0 + 1) / 1]

+--------------------+-------+------+------------+----+
|State/UnionTerritory|   Time|Deaths|TotalSamples|rank|
+--------------------+-------+------+------------+----+
|        Chhattisgarh|8:00 AM| 13544| 1.1762041E7|   1|
|        Chhattisgarh|8:00 AM| 13544| 1.1720007E7|   1|
|        Chhattisgarh|8:00 AM| 13544| 1.1692078E7|   1|
|        Chhattisgarh|8:00 AM| 13544| 1.1666597E7|   1|
|        Chhattisgarh|8:00 AM| 13544| 1.1624064E7|   1|
|        Chhattisgarh|8:00 AM| 13544| 1.1580254E7|   1|
|        Chhattisgarh|8:00 AM| 13544| 1.1535855E7|   1|
|        Chhattisgarh|8:00 AM| 13544| 1.1493309E7|   1|
|        Chhattisgarh|8:00 AM| 13544| 1.1452754E7|   1|
|        Chhattisgarh|8:00 AM| 13544| 1.1416645E7|   1|
|        Chhattisgarh|8:00 AM| 13544| 1.1394233E7|   1|
|        Chhattisgarh|8:00 AM| 13544| 1.1355589E7|   1|
|        Chhattisgarh|8:00 AM| 13544| 1.1312875E7|   1|
|        Chhattisgarh|8:00 AM| 13544| 1.1272374E7|   1|
|        Chhattisgarh|8:00 AM| 13544| 1.1236166E

                                                                                

### 5. Close Spark Instance

In [43]:
spark.stop()