#### PySpark fill() & fillna() in Azure Databricks

###### Gentle reminder: 
In Databricks,
  - sparkSession made available as spark
  - sparkContext made available as sc
  
In case, you want to create it manually, use the below code.

In [0]:
from pyspark.sql.session import SparkSession

spark = SparkSession.builder \
    .master("local[*]") \
    .appName("azurelib.com") \
    .getOrCreate()

sc = spark.sparkContext

##### 1. Create manual PySpark DataFrame

In [0]:
data = [
    ("Duel in the Sun",None,None,7.0),
    ("Tom Jones",None,None,7.0),
    ("Oliver!","Musical","Sony Pictures",7.5),
    ("To Kill A Mockingbird",None,"Universal",8.4),
    ("Tora, Tora, Tora",None,None,None)
]

df = spark.createDataFrame(data, schema=["title", "genre", "distributor", "imdb_rating"])
df.printSchema()
df.show(5, truncate=False)

##### 2. Create PySpark DataFrame by reading files

In [0]:
# replace the file_path with the source file location which you have downloaded.

df_2 = spark.read.format("csv").option("inferSchema", True).option("header", True).load("file_path")
df_2.printSchema()

##### Note: Here, I will be using the manually created dataframe

##### 1. Replacing the null value of the entire column using fill() & fillna()

In [0]:
# Replace only the numberic columns
df.na.fill(0).show()
df.fillna(0).show()

# Replace only the string columns
df.na.fill("unknown").show()
df.fillna("unknown").show()

##### 2. Replacing the null value of the entire column using fill() & fillna()

In [0]:
# Changing the null values of 'genre' column to 'unknown-genre'
df.na.fill(value="unknown-genre", subset=["genre"]).show()
df.fillna(value="unknown-genre", subset=["genre"]).show()

# Changing the null values of 'distributor' column to 'unknown-distributor'
df.na.fill(value="unknown-distributor", subset=["distributor"]).show()
df.fillna(value="unknown-distributor", subset=["distributor"]).show()

##### 3. Replacing the null value of the selected columns with different values

In [0]:
# Changing the null of selected columns with different values
df.na.fill({"genre": "unknown-genre", "distributor": "unknown-distributor"}).show()

##### 4. Replacing the null value with aggregating value

In [0]:
from pyspark.sql.functions import avg

avg_rating = df.select(avg("imdb_rating").alias("avg_rating")).collect()[0]["avg_rating"]
print(f"The average IMDB rating is: {avg_rating}")

df.fillna(value=avg_rating, subset=["imdb_rating"]).show()
# Note you can't replace null value of numeric column with string value.