# PySpark_Tutorial2

We will learn on the below topics

* PySpark DataFrame
* Reading the Dataset
* Verifiying the Datatypes of the Column(schema)
* Selecting Columns and Indexing
* Check Discribe option similer to pandas
* Adding columns
* Dropping columns
* Renaming the columns
* Droping Rows
* Various parameters in Dropping Functionalities 
* Handling the missing values by Mean,Median,Model


In [1]:
# Build the Pyspark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Dataframe").getOrCreate() #create any appName for my purpose i use create Dataframe
spark

In [2]:
#Read the Dataframe from csv file
df = spark.read.option('header','True').csv("top_4000_movies_data.csv")
df

DataFrame[Release Date: string, Movie Title: string, Production Budget: string, Domestic Gross: string, Worldwide Gross: string]

In [3]:
#Checking the Schema
df.printSchema()

root
 |-- Release Date: string (nullable = true)
 |-- Movie Title: string (nullable = true)
 |-- Production Budget: string (nullable = true)
 |-- Domestic Gross: string (nullable = true)
 |-- Worldwide Gross: string (nullable = true)



As you can see that ByDefault it will see all the columns as String so for making better understanding we use , inferSchema = True.

In [4]:
df = spark.read.option("header","True").csv("top_4000_movies_data.csv",inferSchema=True)
df

DataFrame[Release Date: string, Movie Title: string, Production Budget: int, Domestic Gross: int, Worldwide Gross: bigint]

In [5]:
df.printSchema()

root
 |-- Release Date: string (nullable = true)
 |-- Movie Title: string (nullable = true)
 |-- Production Budget: integer (nullable = true)
 |-- Domestic Gross: integer (nullable = true)
 |-- Worldwide Gross: long (nullable = true)



nullable means column has Null Values

In [28]:
# One more-way for reading dataset
df1 = spark.read.csv("top_4000_movies_data.csv",header=True,inferSchema=True)
df1.show()

+------------+--------------------+-----------------+--------------+---------------+
|Release Date|         Movie Title|Production Budget|Domestic Gross|Worldwide Gross|
+------------+--------------------+-----------------+--------------+---------------+
|   4/23/2019|   Avengers: Endgame|        400000000|     858373000|     2797800564|
|   5/20/2011|Pirates of the Ca...|        379000000|     241071802|     1045713802|
|   4/22/2015|Avengers: Age of ...|        365000000|     459005868|     1395316979|
|  12/16/2015|Star Wars Ep. VII...|        306000000|     936662225|     2064615817|
|   4/25/2018|Avengers: Infinit...|        300000000|     678815482|     2044540523|
|   5/24/2007|Pirates of the Ca...|        300000000|     309420425|      960996492|
|  11/13/2017|      Justice League|        300000000|     229024295|      655945209|
|   10/6/2015|             Spectre|        300000000|     200074175|      879500760|
|  12/18/2019|Star Wars: The Ri...|        275000000|     5152025

In [29]:
df1.printSchema()

root
 |-- Release Date: string (nullable = true)
 |-- Movie Title: string (nullable = true)
 |-- Production Budget: integer (nullable = true)
 |-- Domestic Gross: integer (nullable = true)
 |-- Worldwide Gross: long (nullable = true)



In [30]:
type(df1)  # Type is DataFrame that is DataStructure

pyspark.sql.dataframe.DataFrame

In [31]:
# Get the Column name
df1.columns

['Release Date',
 'Movie Title',
 'Production Budget',
 'Domestic Gross',
 'Worldwide Gross']

In [32]:
 # Get any number of rows , ex:- want 5 records
df1.head(5)

[Row(Release Date='4/23/2019', Movie Title='Avengers: Endgame', Production Budget=400000000, Domestic Gross=858373000, Worldwide Gross=2797800564),
 Row(Release Date='5/20/2011', Movie Title='Pirates of the Caribbean: On Stranger Tides', Production Budget=379000000, Domestic Gross=241071802, Worldwide Gross=1045713802),
 Row(Release Date='4/22/2015', Movie Title='Avengers: Age of Ultron', Production Budget=365000000, Domestic Gross=459005868, Worldwide Gross=1395316979),
 Row(Release Date='12/16/2015', Movie Title='Star Wars Ep. VII: The Force Awakens', Production Budget=306000000, Domestic Gross=936662225, Worldwide Gross=2064615817),
 Row(Release Date='4/25/2018', Movie Title='Avengers: Infinity War', Production Budget=300000000, Domestic Gross=678815482, Worldwide Gross=2044540523)]

In [33]:
# Fetch the column names from the dataframe 
df1.select("Movie Title").show()

# You can use ',' for more than one columns
df1.select("Release Date","Production Budget").show()

# or you can write like below
df1.select(["Movie Title","Release Date","Production Budget"]).show()

+--------------------+
|         Movie Title|
+--------------------+
|   Avengers: Endgame|
|Pirates of the Ca...|
|Avengers: Age of ...|
|Star Wars Ep. VII...|
|Avengers: Infinit...|
|Pirates of the Ca...|
|      Justice League|
|             Spectre|
|Star Wars: The Ri...|
|Solo: A Star Wars...|
|         John Carter|
|Batman v Superman...|
|Star Wars Ep. VII...|
|       The Lion King|
|             Tangled|
|        Spider-Man 3|
|Captain America: ...|
|Harry Potter and ...|
|The Hobbit: The D...|
|The Hobbit: The B...|
+--------------------+
only showing top 20 rows

+------------+-----------------+
|Release Date|Production Budget|
+------------+-----------------+
|   4/23/2019|        400000000|
|   5/20/2011|        379000000|
|   4/22/2015|        365000000|
|  12/16/2015|        306000000|
|   4/25/2018|        300000000|
|   5/24/2007|        300000000|
|  11/13/2017|        300000000|
|   10/6/2015|        300000000|
|  12/18/2019|        275000000|
|   5/23/2018|        2750

In [34]:
# Check Datatypes
df1.dtypes

[('Release Date', 'string'),
 ('Movie Title', 'string'),
 ('Production Budget', 'int'),
 ('Domestic Gross', 'int'),
 ('Worldwide Gross', 'bigint')]

In [35]:
# Get describe
df1.describe().show()

+-------+------------+-------------------+--------------------+-------------------+-------------------+
|summary|Release Date|        Movie Title|   Production Budget|     Domestic Gross|    Worldwide Gross|
+-------+------------+-------------------+--------------------+-------------------+-------------------+
|  count|        3979|               4000|                4000|               4000|               4000|
|   mean|        null|  1032.909090909091|    4.693966688675E7|   5.896280106275E7|    1.33327273368E8|
| stddev|        null|  928.2740386917491|4.5202281986870565E7|7.898698033346803E7|2.101515489005335E8|
|    min|    1/1/1970|10 Cloverfield Lane|             9500000|                  0|                  0|
|    max|    9/9/2016|     長江七號 (CJ7)|           400000000|          936662225|         2845899541|
+-------+------------+-------------------+--------------------+-------------------+-------------------+



In [36]:
# How to add the column in dataframe
df1= df1.withColumn("Movie Revenue %",df1["Worldwide Gross"]/df1["Production Budget"]*100)

In [37]:
df1.show()

+------------+--------------------+-----------------+--------------+---------------+------------------+
|Release Date|         Movie Title|Production Budget|Domestic Gross|Worldwide Gross|   Movie Revenue %|
+------------+--------------------+-----------------+--------------+---------------+------------------+
|   4/23/2019|   Avengers: Endgame|        400000000|     858373000|     2797800564|        699.450141|
|   5/20/2011|Pirates of the Ca...|        379000000|     241071802|     1045713802| 275.9139319261214|
|   4/22/2015|Avengers: Age of ...|        365000000|     459005868|     1395316979|382.27862438356163|
|  12/16/2015|Star Wars Ep. VII...|        306000000|     936662225|     2064615817| 674.7110513071896|
|   4/25/2018|Avengers: Infinit...|        300000000|     678815482|     2044540523| 681.5135076666667|
|   5/24/2007|Pirates of the Ca...|        300000000|     309420425|      960996492|        320.332164|
|  11/13/2017|      Justice League|        300000000|     229024

In [38]:
# Rename the existing column name 
df1=df1.withColumnRenamed("Movie Revenue %","Revenue % wise-growth")


In [39]:
df1.show()

+------------+--------------------+-----------------+--------------+---------------+---------------------+
|Release Date|         Movie Title|Production Budget|Domestic Gross|Worldwide Gross|Revenue % wise-growth|
+------------+--------------------+-----------------+--------------+---------------+---------------------+
|   4/23/2019|   Avengers: Endgame|        400000000|     858373000|     2797800564|           699.450141|
|   5/20/2011|Pirates of the Ca...|        379000000|     241071802|     1045713802|    275.9139319261214|
|   4/22/2015|Avengers: Age of ...|        365000000|     459005868|     1395316979|   382.27862438356163|
|  12/16/2015|Star Wars Ep. VII...|        306000000|     936662225|     2064615817|    674.7110513071896|
|   4/25/2018|Avengers: Infinit...|        300000000|     678815482|     2044540523|    681.5135076666667|
|   5/24/2007|Pirates of the Ca...|        300000000|     309420425|      960996492|           320.332164|
|  11/13/2017|      Justice League|  

In [40]:
#Drop the column from DataFrame
df1= df1.drop("Revenue % wise-growth")

In [41]:
df1.show()

+------------+--------------------+-----------------+--------------+---------------+
|Release Date|         Movie Title|Production Budget|Domestic Gross|Worldwide Gross|
+------------+--------------------+-----------------+--------------+---------------+
|   4/23/2019|   Avengers: Endgame|        400000000|     858373000|     2797800564|
|   5/20/2011|Pirates of the Ca...|        379000000|     241071802|     1045713802|
|   4/22/2015|Avengers: Age of ...|        365000000|     459005868|     1395316979|
|  12/16/2015|Star Wars Ep. VII...|        306000000|     936662225|     2064615817|
|   4/25/2018|Avengers: Infinit...|        300000000|     678815482|     2044540523|
|   5/24/2007|Pirates of the Ca...|        300000000|     309420425|      960996492|
|  11/13/2017|      Justice League|        300000000|     229024295|      655945209|
|   10/6/2015|             Spectre|        300000000|     200074175|      879500760|
|  12/18/2019|Star Wars: The Ri...|        275000000|     5152025

In [42]:
#Rename the column
df1 = df1.withColumnRenamed("Movie Title","Movie Name")
df1.show()

+------------+--------------------+-----------------+--------------+---------------+
|Release Date|          Movie Name|Production Budget|Domestic Gross|Worldwide Gross|
+------------+--------------------+-----------------+--------------+---------------+
|   4/23/2019|   Avengers: Endgame|        400000000|     858373000|     2797800564|
|   5/20/2011|Pirates of the Ca...|        379000000|     241071802|     1045713802|
|   4/22/2015|Avengers: Age of ...|        365000000|     459005868|     1395316979|
|  12/16/2015|Star Wars Ep. VII...|        306000000|     936662225|     2064615817|
|   4/25/2018|Avengers: Infinit...|        300000000|     678815482|     2044540523|
|   5/24/2007|Pirates of the Ca...|        300000000|     309420425|      960996492|
|  11/13/2017|      Justice League|        300000000|     229024295|      655945209|
|   10/6/2015|             Spectre|        300000000|     200074175|      879500760|
|  12/18/2019|Star Wars: The Ri...|        275000000|     5152025

In [43]:
# Drop the Row with respect to Nan value
df1.na.drop().show()

+------------+--------------------+-----------------+--------------+---------------+
|Release Date|          Movie Name|Production Budget|Domestic Gross|Worldwide Gross|
+------------+--------------------+-----------------+--------------+---------------+
|   4/23/2019|   Avengers: Endgame|        400000000|     858373000|     2797800564|
|   5/20/2011|Pirates of the Ca...|        379000000|     241071802|     1045713802|
|   4/22/2015|Avengers: Age of ...|        365000000|     459005868|     1395316979|
|  12/16/2015|Star Wars Ep. VII...|        306000000|     936662225|     2064615817|
|   4/25/2018|Avengers: Infinit...|        300000000|     678815482|     2044540523|
|   5/24/2007|Pirates of the Ca...|        300000000|     309420425|      960996492|
|  11/13/2017|      Justice League|        300000000|     229024295|      655945209|
|   10/6/2015|             Spectre|        300000000|     200074175|      879500760|
|  12/18/2019|Star Wars: The Ri...|        275000000|     5152025

In [44]:
df1.printSchema()

root
 |-- Release Date: string (nullable = true)
 |-- Movie Name: string (nullable = true)
 |-- Production Budget: integer (nullable = true)
 |-- Domestic Gross: integer (nullable = true)
 |-- Worldwide Gross: long (nullable = true)



In [45]:
# Drop using  "any" keyword
df1.na.drop(how="any").show()

+------------+--------------------+-----------------+--------------+---------------+
|Release Date|          Movie Name|Production Budget|Domestic Gross|Worldwide Gross|
+------------+--------------------+-----------------+--------------+---------------+
|   4/23/2019|   Avengers: Endgame|        400000000|     858373000|     2797800564|
|   5/20/2011|Pirates of the Ca...|        379000000|     241071802|     1045713802|
|   4/22/2015|Avengers: Age of ...|        365000000|     459005868|     1395316979|
|  12/16/2015|Star Wars Ep. VII...|        306000000|     936662225|     2064615817|
|   4/25/2018|Avengers: Infinit...|        300000000|     678815482|     2044540523|
|   5/24/2007|Pirates of the Ca...|        300000000|     309420425|      960996492|
|  11/13/2017|      Justice League|        300000000|     229024295|      655945209|
|   10/6/2015|             Spectre|        300000000|     200074175|      879500760|
|  12/18/2019|Star Wars: The Ri...|        275000000|     5152025

In [47]:
# Drop using threshold # thresh = 3 means atleast 3 non nan values are present in row, else entire row will be dropped
df1.na.drop(how='any',thresh=3).show()

+------------+--------------------+-----------------+--------------+---------------+
|Release Date|          Movie Name|Production Budget|Domestic Gross|Worldwide Gross|
+------------+--------------------+-----------------+--------------+---------------+
|   4/23/2019|   Avengers: Endgame|        400000000|     858373000|     2797800564|
|   5/20/2011|Pirates of the Ca...|        379000000|     241071802|     1045713802|
|   4/22/2015|Avengers: Age of ...|        365000000|     459005868|     1395316979|
|  12/16/2015|Star Wars Ep. VII...|        306000000|     936662225|     2064615817|
|   4/25/2018|Avengers: Infinit...|        300000000|     678815482|     2044540523|
|   5/24/2007|Pirates of the Ca...|        300000000|     309420425|      960996492|
|  11/13/2017|      Justice League|        300000000|     229024295|      655945209|
|   10/6/2015|             Spectre|        300000000|     200074175|      879500760|
|  12/18/2019|Star Wars: The Ri...|        275000000|     5152025

In [48]:
# Drop nan values only on specific column , using Subset
df1.na.drop(how='any',subset=["Movie Name"]).show()


+------------+--------------------+-----------------+--------------+---------------+
|Release Date|          Movie Name|Production Budget|Domestic Gross|Worldwide Gross|
+------------+--------------------+-----------------+--------------+---------------+
|   4/23/2019|   Avengers: Endgame|        400000000|     858373000|     2797800564|
|   5/20/2011|Pirates of the Ca...|        379000000|     241071802|     1045713802|
|   4/22/2015|Avengers: Age of ...|        365000000|     459005868|     1395316979|
|  12/16/2015|Star Wars Ep. VII...|        306000000|     936662225|     2064615817|
|   4/25/2018|Avengers: Infinit...|        300000000|     678815482|     2044540523|
|   5/24/2007|Pirates of the Ca...|        300000000|     309420425|      960996492|
|  11/13/2017|      Justice League|        300000000|     229024295|      655945209|
|   10/6/2015|             Spectre|        300000000|     200074175|      879500760|
|  12/18/2019|Star Wars: The Ri...|        275000000|     5152025

In [49]:
#Filling the missing values
df1.na.fill("Missing values").show() #Where the Nan values are present in coloumn, "Missing values" will be replaced with that.



+------------+--------------------+-----------------+--------------+---------------+
|Release Date|          Movie Name|Production Budget|Domestic Gross|Worldwide Gross|
+------------+--------------------+-----------------+--------------+---------------+
|   4/23/2019|   Avengers: Endgame|        400000000|     858373000|     2797800564|
|   5/20/2011|Pirates of the Ca...|        379000000|     241071802|     1045713802|
|   4/22/2015|Avengers: Age of ...|        365000000|     459005868|     1395316979|
|  12/16/2015|Star Wars Ep. VII...|        306000000|     936662225|     2064615817|
|   4/25/2018|Avengers: Infinit...|        300000000|     678815482|     2044540523|
|   5/24/2007|Pirates of the Ca...|        300000000|     309420425|      960996492|
|  11/13/2017|      Justice League|        300000000|     229024295|      655945209|
|   10/6/2015|             Spectre|        300000000|     200074175|      879500760|
|  12/18/2019|Star Wars: The Ri...|        275000000|     5152025

In [51]:
#Filling the missing values for perticuler column
# Where the Nan Values are present it will replaced with "Missing Value" word on particular column, you can give any values
df1.na.fill("missing values","Movie Name").show()  


+------------+--------------------+-----------------+--------------+---------------+
|Release Date|          Movie Name|Production Budget|Domestic Gross|Worldwide Gross|
+------------+--------------------+-----------------+--------------+---------------+
|   4/23/2019|   Avengers: Endgame|        400000000|     858373000|     2797800564|
|   5/20/2011|Pirates of the Ca...|        379000000|     241071802|     1045713802|
|   4/22/2015|Avengers: Age of ...|        365000000|     459005868|     1395316979|
|  12/16/2015|Star Wars Ep. VII...|        306000000|     936662225|     2064615817|
|   4/25/2018|Avengers: Infinit...|        300000000|     678815482|     2044540523|
|   5/24/2007|Pirates of the Ca...|        300000000|     309420425|      960996492|
|  11/13/2017|      Justice League|        300000000|     229024295|      655945209|
|   10/6/2015|             Spectre|        300000000|     200074175|      879500760|
|  12/18/2019|Star Wars: The Ri...|        275000000|     5152025

In [52]:
#Fill in multiple columns
 # Where the Nan Values are present it will replaced with "Missing Value" word on particular columns
df1.na.fill("Missing Value", ["Movie Name","Release Date"]).show()


+------------+--------------------+-----------------+--------------+---------------+
|Release Date|          Movie Name|Production Budget|Domestic Gross|Worldwide Gross|
+------------+--------------------+-----------------+--------------+---------------+
|   4/23/2019|   Avengers: Endgame|        400000000|     858373000|     2797800564|
|   5/20/2011|Pirates of the Ca...|        379000000|     241071802|     1045713802|
|   4/22/2015|Avengers: Age of ...|        365000000|     459005868|     1395316979|
|  12/16/2015|Star Wars Ep. VII...|        306000000|     936662225|     2064615817|
|   4/25/2018|Avengers: Infinit...|        300000000|     678815482|     2044540523|
|   5/24/2007|Pirates of the Ca...|        300000000|     309420425|      960996492|
|  11/13/2017|      Justice League|        300000000|     229024295|      655945209|
|   10/6/2015|             Spectre|        300000000|     200074175|      879500760|
|  12/18/2019|Star Wars: The Ri...|        275000000|     5152025

In [56]:
# Fill nan(null) Values with mean 
#Imputation estimator for completing missing values, using the mean, median or mode of the columns in which the missing values are located. 
from pyspark.ml.feature import Imputer

imputer = Imputer(
inputCols=["Production Budget","Domestic Gross"],
outputCols=["{}_inputed".format(c) for c in ["Production Budget","Domestic Gross"]]
).setStrategy("mean") # Change the setStrategy with median and mode also
imputer.fit(df1).transform(df1).show()

+------------+--------------------+-----------------+--------------+---------------+-------------------------+----------------------+
|Release Date|          Movie Name|Production Budget|Domestic Gross|Worldwide Gross|Production Budget_inputed|Domestic Gross_inputed|
+------------+--------------------+-----------------+--------------+---------------+-------------------------+----------------------+
|   4/23/2019|   Avengers: Endgame|        400000000|     858373000|     2797800564|                400000000|             858373000|
|   5/20/2011|Pirates of the Ca...|        379000000|     241071802|     1045713802|                379000000|             241071802|
|   4/22/2015|Avengers: Age of ...|        365000000|     459005868|     1395316979|                365000000|             459005868|
|  12/16/2015|Star Wars Ep. VII...|        306000000|     936662225|     2064615817|                306000000|             936662225|
|   4/25/2018|Avengers: Infinit...|        300000000|     6788