####Data Transformation

At the start of the transformation process, the inferred data types and table schema were checked to ensure that all columns were correctly interpreted by Spark. This verification ensures numeric, string, and date fields are properly recognized, preventing errors during subsequent transformations and calculations.


In [0]:
# Reading bronze dataset
bronze_df = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("/mnt/ADLSmount/bronze/movies_dataset/")

# display dataset and schema
bronze_df.show
bronze_df.printSchema()


root
 |-- MOVIES: string (nullable = true)
 |-- YEAR: string (nullable = true)
 |-- GENRE: string (nullable = true)
 |-- RATING: string (nullable = true)
 |-- ONE-LINE: string (nullable = true)
 |-- STARS: string (nullable = true)
 |-- VOTES: string (nullable = true)
 |-- RunTime: string (nullable = true)
 |-- Gross: string (nullable = true)



In [0]:
bronze_df.show(10)

+--------------------+-------+--------------------+------+--------+-----+-----+-------+-----+
|              MOVIES|   YEAR|               GENRE|RATING|ONE-LINE|STARS|VOTES|RunTime|Gross|
+--------------------+-------+--------------------+------+--------+-----+-----+-------+-----+
|       Blood Red Sky| (2021)|                  \n|  NULL|    NULL| NULL| NULL|   NULL| NULL|
|              Action| Horror| Thriller        ...|   6.1|      \n| NULL| NULL|   NULL| NULL|
|A woman with a my...|     \n|                NULL|  NULL|    NULL| NULL| NULL|   NULL| NULL|
|           Director:|   NULL|                NULL|  NULL|    NULL| NULL| NULL|   NULL| NULL|
|     Peter Thorwarth|   NULL|                NULL|  NULL|    NULL| NULL| NULL|   NULL| NULL|
|                  | |   NULL|                NULL|  NULL|    NULL| NULL| NULL|   NULL| NULL|
|              Stars:|   NULL|                NULL|  NULL|    NULL| NULL| NULL|   NULL| NULL|
|     Peri Baumeister|       |                NULL|  NULL|  

The initial output showed that the data was not formatted correctly. To better understand its structure, the first 500 bytes of the raw file were previewed using print(dbutils.fs.head(raw_path, 500)). This helped identify the delimiters, multiline values, and any irregularities present in the dataset.

In [0]:
raw_path = "/mnt/ADLSmount/bronze/movies_dataset/movies.csv"

print(dbutils.fs.head(raw_path, 500))



[Truncated to first 500 bytes]
MOVIES,YEAR,GENRE,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross
Blood Red Sky,(2021),"
Action, Horror, Thriller            ",6.1,"
A woman with a mysterious illness is forced into action when a group of terrorists attempt to hijack a transatlantic overnight flight.","
    Director:
Peter Thorwarth
| 
    Stars:
Peri Baumeister, 
Carl Anton Koch, 
Alexander Scheer, 
Kais Setti
","21,062",121,
Masters of the Universe: Revelation,(2021– ),"
Animation, Action, Adventure            ",5.0,"
The war fo


A StructType schema was defined to explicitly specify the data types for each column, ensuring correct type handling instead of relying on Spark’s automatic inference. The option .option("multiLine", True) allows reading values that span multiple lines within quotes, and .option("escape", "\"") ensures that embedded double quotes are correctly interpreted, preventing parsing errors in the CSV.

In [0]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

# Define schema
schema = StructType([
    StructField("MOVIES", StringType(), False),  # not null
    StructField("YEAR", StringType(), True),
    StructField("GENRE", StringType(), True),
    StructField("RATING", DoubleType(), True),
    StructField("ONE-LINE", StringType(), True),
    StructField("STARS", StringType(), True),
    StructField("VOTES", StringType(), True),
    StructField("RunTime", IntegerType(), True),
    StructField("Gross", StringType(), True)
])

raw_path = "/mnt/ADLSmount/bronze/movies_dataset/movies.csv"

df_bronze = (
    spark.read
        .option("header", True)
        .option("multiLine", True)   # allows newlines inside quotes
        .option("escape", "\"")      # handle quotes properly
        .schema(schema)
        .csv(raw_path)
)

display(df_bronze.limit(10))


MOVIES,YEAR,GENRE,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross
Blood Red Sky,(2021),"Action, Horror, Thriller",6.1,A woman with a mysterious illness is forced into action when a group of terrorists attempt to hijack a transatlantic overnight flight.,"Director: Peter Thorwarth | Stars: Peri Baumeister, Carl Anton Koch, Alexander Scheer, Kais Setti",21062.0,121.0,
Masters of the Universe: Revelation,(2021– ),"Animation, Action, Adventure",5.0,The war for Eternia begins again in what may be the final battle between He-Man and Skeletor. A new animated series from writer-director Kevin Smith.,"Stars: Chris Wood, Sarah Michelle Gellar, Lena Headey, Mark Hamill",17870.0,25.0,
The Walking Dead,(2010–2022),"Drama, Horror, Thriller",8.2,Sheriff Deputy Rick Grimes wakes up from a coma to learn the world is in ruins and must lead a group of survivors to stay alive.,"Stars: Andrew Lincoln, Norman Reedus, Melissa McBride, Lauren Cohan",885805.0,44.0,
Rick and Morty,(2013– ),"Animation, Adventure, Comedy",9.2,An animated series that follows the exploits of a super scientist and his not-so-bright grandson.,"Stars: Justin Roiland, Chris Parnell, Spencer Grammer, Sarah Chalke",414849.0,23.0,
Army of Thieves,(2021),"Action, Crime, Horror",,"A prequel, set before the events of Army of the Dead, which focuses on German safecracker Ludwig Dieter leading a group of aspiring thieves on a top secret heist during the early stages of the zombie apocalypse.","Director: Matthias Schweighöfer | Stars: Matthias Schweighöfer, Nathalie Emmanuel, Ruby O. Fee, Stuart Martin",,,
Outer Banks,(2020– ),"Action, Crime, Drama",7.6,A group of teenagers from the wrong side of the tracks stumble upon a treasure map that unearths a long buried secret.,"Stars: Chase Stokes, Madelyn Cline, Madison Bailey, Jonathan Daviss",25858.0,50.0,
The Last Letter from Your Lover,(2021),"Drama, Romance",6.8,A pair of interwoven stories set in the past and present follow an ambitious journalist determined to solve the mystery of a forbidden love affair at the center of a trove of secret love letters from 1965.,"Director: Augustine Frizzell | Stars: Shailene Woodley, Joe Alwyn, Wendy Nottingham, Felicity Jones",5283.0,110.0,
Dexter,(2006–2013),"Crime, Drama, Mystery",8.6,"By day, mild-mannered Dexter is a blood-spatter analyst for the Miami police. But at night, he is a serial killer who only targets other murderers.","Stars: Michael C. Hall, Jennifer Carpenter, David Zayas, James Remar",665387.0,53.0,
Never Have I Ever,(2020– ),Comedy,7.9,"The complicated life of a modern-day first generation Indian American teenage girl, inspired by Mindy Kaling's own childhood.","Stars: Maitreyi Ramakrishnan, Poorna Jagannathan, Darren Barnet, John McEnroe",34530.0,30.0,
Virgin River,(2019– ),"Drama, Romance",7.4,"Seeking a fresh start, nurse practitioner Melinda Monroe moves from Los Angeles to a remote Northern California town and is surprised by what and who she finds.","Stars: Alexandra Breckenridge, Martin Henderson, Colin Lawrence, Tim Matheson",27279.0,44.0,


In [0]:
'''The MOVIES column was checked for null values since it is the primary identifier for each record. Using dropna confirmed that there are no nulls in this column, ensuring all movie records are valid for subsequent transformations.'''

df_cleaned = df_bronze.dropna(subset=["MOVIES"])

print("Before dropping nulls:", df_bronze.count())
print("After dropping nulls:", df_cleaned.count())


Before dropping nulls: 9999
After dropping nulls: 9999


In [0]:
'''The YEAR column was extracted using a regular expression to retain only the four-digit year and cast to integer. Null or invalid values were filled with 0 to ensure numeric consistency for analysis.'''

from pyspark.sql.functions import regexp_extract, col

df_cleaned = (
    df_bronze
    .withColumn("YEAR", regexp_extract(col("YEAR"), r"(\d{4})", 1).cast("int"))
    .fillna({"YEAR": 0}) 
)
df_cleaned.limit(5).display()


MOVIES,YEAR,GENRE,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross
Blood Red Sky,2021,"Action, Horror, Thriller",6.1,A woman with a mysterious illness is forced into action when a group of terrorists attempt to hijack a transatlantic overnight flight.,"Director: Peter Thorwarth | Stars: Peri Baumeister, Carl Anton Koch, Alexander Scheer, Kais Setti",21062.0,121.0,
Masters of the Universe: Revelation,2021,"Animation, Action, Adventure",5.0,The war for Eternia begins again in what may be the final battle between He-Man and Skeletor. A new animated series from writer-director Kevin Smith.,"Stars: Chris Wood, Sarah Michelle Gellar, Lena Headey, Mark Hamill",17870.0,25.0,
The Walking Dead,2010,"Drama, Horror, Thriller",8.2,Sheriff Deputy Rick Grimes wakes up from a coma to learn the world is in ruins and must lead a group of survivors to stay alive.,"Stars: Andrew Lincoln, Norman Reedus, Melissa McBride, Lauren Cohan",885805.0,44.0,
Rick and Morty,2013,"Animation, Adventure, Comedy",9.2,An animated series that follows the exploits of a super scientist and his not-so-bright grandson.,"Stars: Justin Roiland, Chris Parnell, Spencer Grammer, Sarah Chalke",414849.0,23.0,
Army of Thieves,2021,"Action, Crime, Horror",,"A prequel, set before the events of Army of the Dead, which focuses on German safecracker Ludwig Dieter leading a group of aspiring thieves on a top secret heist during the early stages of the zombie apocalypse.","Director: Matthias Schweighöfer | Stars: Matthias Schweighöfer, Nathalie Emmanuel, Ruby O. Fee, Stuart Martin",,,


In [0]:
#Null values in the GENRE column were replaced with 'Unknown' to ensure all records have a valid genre for analysis.

df_cleaned = (
    df_cleaned
    .fillna({"GENRE": 'Unknown'}) 
)

df_cleaned.limit(5).display()

MOVIES,YEAR,GENRE,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross
Blood Red Sky,2021,"Action, Horror, Thriller",6.1,A woman with a mysterious illness is forced into action when a group of terrorists attempt to hijack a transatlantic overnight flight.,"Director: Peter Thorwarth | Stars: Peri Baumeister, Carl Anton Koch, Alexander Scheer, Kais Setti",21062.0,121.0,
Masters of the Universe: Revelation,2021,"Animation, Action, Adventure",5.0,The war for Eternia begins again in what may be the final battle between He-Man and Skeletor. A new animated series from writer-director Kevin Smith.,"Stars: Chris Wood, Sarah Michelle Gellar, Lena Headey, Mark Hamill",17870.0,25.0,
The Walking Dead,2010,"Drama, Horror, Thriller",8.2,Sheriff Deputy Rick Grimes wakes up from a coma to learn the world is in ruins and must lead a group of survivors to stay alive.,"Stars: Andrew Lincoln, Norman Reedus, Melissa McBride, Lauren Cohan",885805.0,44.0,
Rick and Morty,2013,"Animation, Adventure, Comedy",9.2,An animated series that follows the exploits of a super scientist and his not-so-bright grandson.,"Stars: Justin Roiland, Chris Parnell, Spencer Grammer, Sarah Chalke",414849.0,23.0,
Army of Thieves,2021,"Action, Crime, Horror",,"A prequel, set before the events of Army of the Dead, which focuses on German safecracker Ludwig Dieter leading a group of aspiring thieves on a top secret heist during the early stages of the zombie apocalypse.","Director: Matthias Schweighöfer | Stars: Matthias Schweighöfer, Nathalie Emmanuel, Ruby O. Fee, Stuart Martin",,,


In [0]:
'''ull values in the RATING column were replaced with the median rating to maintain a representative value and avoid skewing analyses.'''

# Calculating median
median_rating = df_cleaned.approxQuantile("RATING", [0.5], 0.01)[0]

# Fill null ratings with median
df_cleaned = df_cleaned.fillna({"RATING": median_rating})

df_cleaned.limit(5).display()


MOVIES,YEAR,GENRE,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross
Blood Red Sky,2021,"Action, Horror, Thriller",6.1,A woman with a mysterious illness is forced into action when a group of terrorists attempt to hijack a transatlantic overnight flight.,"Director: Peter Thorwarth | Stars: Peri Baumeister, Carl Anton Koch, Alexander Scheer, Kais Setti",21062.0,121.0,
Masters of the Universe: Revelation,2021,"Animation, Action, Adventure",5.0,The war for Eternia begins again in what may be the final battle between He-Man and Skeletor. A new animated series from writer-director Kevin Smith.,"Stars: Chris Wood, Sarah Michelle Gellar, Lena Headey, Mark Hamill",17870.0,25.0,
The Walking Dead,2010,"Drama, Horror, Thriller",8.2,Sheriff Deputy Rick Grimes wakes up from a coma to learn the world is in ruins and must lead a group of survivors to stay alive.,"Stars: Andrew Lincoln, Norman Reedus, Melissa McBride, Lauren Cohan",885805.0,44.0,
Rick and Morty,2013,"Animation, Adventure, Comedy",9.2,An animated series that follows the exploits of a super scientist and his not-so-bright grandson.,"Stars: Justin Roiland, Chris Parnell, Spencer Grammer, Sarah Chalke",414849.0,23.0,
Army of Thieves,2021,"Action, Crime, Horror",7.1,"A prequel, set before the events of Army of the Dead, which focuses on German safecracker Ludwig Dieter leading a group of aspiring thieves on a top secret heist during the early stages of the zombie apocalypse.","Director: Matthias Schweighöfer | Stars: Matthias Schweighöfer, Nathalie Emmanuel, Ruby O. Fee, Stuart Martin",,,


In [0]:
'''Null values in the VOTES column were replaced with 0 to ensure all records have a numeric value, preventing errors in calculations and aggregations'''

df_cleaned = (
    df_cleaned
    .fillna({"VOTES": 0}) 
)

In [0]:
'''Null values in the RunTime column were replaced with the median runtime to maintain a representative value and ensure consistency in analyses involving movie durations.'''

# Median of Runtime
median_runtime = df_cleaned.approxQuantile("RunTime", [0.5], 0.01)[0]
# print(median_runtime)

# Fill null ratings with median
df_cleaned = df_cleaned.fillna({"RunTime": median_runtime})

df_cleaned.limit(5).display()

MOVIES,YEAR,GENRE,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross
Blood Red Sky,2021,"Action, Horror, Thriller",6.1,A woman with a mysterious illness is forced into action when a group of terrorists attempt to hijack a transatlantic overnight flight.,"Director: Peter Thorwarth | Stars: Peri Baumeister, Carl Anton Koch, Alexander Scheer, Kais Setti",21062,121,
Masters of the Universe: Revelation,2021,"Animation, Action, Adventure",5.0,The war for Eternia begins again in what may be the final battle between He-Man and Skeletor. A new animated series from writer-director Kevin Smith.,"Stars: Chris Wood, Sarah Michelle Gellar, Lena Headey, Mark Hamill",17870,25,
The Walking Dead,2010,"Drama, Horror, Thriller",8.2,Sheriff Deputy Rick Grimes wakes up from a coma to learn the world is in ruins and must lead a group of survivors to stay alive.,"Stars: Andrew Lincoln, Norman Reedus, Melissa McBride, Lauren Cohan",885805,44,
Rick and Morty,2013,"Animation, Adventure, Comedy",9.2,An animated series that follows the exploits of a super scientist and his not-so-bright grandson.,"Stars: Justin Roiland, Chris Parnell, Spencer Grammer, Sarah Chalke",414849,23,
Army of Thieves,2021,"Action, Crime, Horror",7.1,"A prequel, set before the events of Army of the Dead, which focuses on German safecracker Ludwig Dieter leading a group of aspiring thieves on a top secret heist during the early stages of the zombie apocalypse.","Director: Matthias Schweighöfer | Stars: Matthias Schweighöfer, Nathalie Emmanuel, Ruby O. Fee, Stuart Martin",0,60,


In [0]:
'''The Gross column contains many null values. To decide whether to keep or drop it, the percentage of nulls was calculated to assess its completeness and usefulness for analysis.'''

from pyspark.sql import functions as F

total_rows = df_cleaned.count()

# null count in Gross
null_count = df_cleaned.filter(F.col("Gross").isNull()).count()

print(f"Total rows: {total_rows}")
print(f"Nulls in Gross: {null_count}")
print(f"Percentage nulls: {(null_count/total_rows)*100:.2f}%")


Total rows: 9999
Nulls in Gross: 9539
Percentage nulls: 95.40%


In [0]:
'''Since 95.4% of the Gross column values were null, the column was dropped. Even if nulls were handled, the remaining data would be too sparse and unreliable for meaningful analysis.'''

df_silver = df_cleaned.drop("Gross")

df_silver.limit(5).display()

MOVIES,YEAR,GENRE,RATING,ONE-LINE,STARS,VOTES,RunTime
Blood Red Sky,2021,"Action, Horror, Thriller",6.1,A woman with a mysterious illness is forced into action when a group of terrorists attempt to hijack a transatlantic overnight flight.,"Director: Peter Thorwarth | Stars: Peri Baumeister, Carl Anton Koch, Alexander Scheer, Kais Setti",21062,121
Masters of the Universe: Revelation,2021,"Animation, Action, Adventure",5.0,The war for Eternia begins again in what may be the final battle between He-Man and Skeletor. A new animated series from writer-director Kevin Smith.,"Stars: Chris Wood, Sarah Michelle Gellar, Lena Headey, Mark Hamill",17870,25
The Walking Dead,2010,"Drama, Horror, Thriller",8.2,Sheriff Deputy Rick Grimes wakes up from a coma to learn the world is in ruins and must lead a group of survivors to stay alive.,"Stars: Andrew Lincoln, Norman Reedus, Melissa McBride, Lauren Cohan",885805,44
Rick and Morty,2013,"Animation, Adventure, Comedy",9.2,An animated series that follows the exploits of a super scientist and his not-so-bright grandson.,"Stars: Justin Roiland, Chris Parnell, Spencer Grammer, Sarah Chalke",414849,23
Army of Thieves,2021,"Action, Crime, Horror",7.1,"A prequel, set before the events of Army of the Dead, which focuses on German safecracker Ludwig Dieter leading a group of aspiring thieves on a top secret heist during the early stages of the zombie apocalypse.","Director: Matthias Schweighöfer | Stars: Matthias Schweighöfer, Nathalie Emmanuel, Ruby O. Fee, Stuart Martin",0,60


In [0]:
'''The VOTES column could not be directly cast to integer because it contained commas. A regular expression was used to remove all non-numeric characters, and the column was then cast to a long type for accurate numeric analysis.'''

from pyspark.sql.functions import regexp_replace, col

df_silver = df_silver.withColumn(
    "VOTES",
    regexp_replace(col("VOTES"), "[^0-9]", "").cast("long")  # remove everything except digits
)

df_silver.limit(5).display()


MOVIES,YEAR,GENRE,RATING,ONE-LINE,STARS,VOTES,RunTime
Blood Red Sky,2021,"Action, Horror, Thriller",6.1,A woman with a mysterious illness is forced into action when a group of terrorists attempt to hijack a transatlantic overnight flight.,"Director: Peter Thorwarth | Stars: Peri Baumeister, Carl Anton Koch, Alexander Scheer, Kais Setti",21062,121
Masters of the Universe: Revelation,2021,"Animation, Action, Adventure",5.0,The war for Eternia begins again in what may be the final battle between He-Man and Skeletor. A new animated series from writer-director Kevin Smith.,"Stars: Chris Wood, Sarah Michelle Gellar, Lena Headey, Mark Hamill",17870,25
The Walking Dead,2010,"Drama, Horror, Thriller",8.2,Sheriff Deputy Rick Grimes wakes up from a coma to learn the world is in ruins and must lead a group of survivors to stay alive.,"Stars: Andrew Lincoln, Norman Reedus, Melissa McBride, Lauren Cohan",885805,44
Rick and Morty,2013,"Animation, Adventure, Comedy",9.2,An animated series that follows the exploits of a super scientist and his not-so-bright grandson.,"Stars: Justin Roiland, Chris Parnell, Spencer Grammer, Sarah Chalke",414849,23
Army of Thieves,2021,"Action, Crime, Horror",7.1,"A prequel, set before the events of Army of the Dead, which focuses on German safecracker Ludwig Dieter leading a group of aspiring thieves on a top secret heist during the early stages of the zombie apocalypse.","Director: Matthias Schweighöfer | Stars: Matthias Schweighöfer, Nathalie Emmanuel, Ruby O. Fee, Stuart Martin",0,60


In [0]:
'''The cleaned Silver DataFrame was written to the ADLS Silver layer in Delta format using overwrite mode.'''

ADLS_Silver_path = "/mnt/ADLSmount/silver/movies_dataset/"

df_silver.write.format("delta") \
    .mode("overwrite")\
    .save(ADLS_Silver_path)


#####Silver dataset is present in ADLS in Parquet format
![Silver Layer in ADLS](https://adb-582130891499017.17.azuredatabricks.net/files/tables/ADLSsnipSilver.png)

In [0]:
# Delta table was created on the Silver layer using the data stored in ADLS

spark.sql("""
    CREATE TABLE IF NOT EXISTS delta_table_silver
    USING DELTA
    LOCATION '{}'
""".format(ADLS_Silver_path))

DataFrame[]

In [0]:
%sql
SELECT * FROM delta_table_silver where LIMIT 5;


MOVIES,YEAR,GENRE,RATING,ONE-LINE,STARS,VOTES,RunTime
Blood Red Sky,2021,"Action, Horror, Thriller",6.1,A woman with a mysterious illness is forced into action when a group of terrorists attempt to hijack a transatlantic overnight flight.,"Director: Peter Thorwarth | Stars: Peri Baumeister, Carl Anton Koch, Alexander Scheer, Kais Setti",21062,121
Masters of the Universe: Revelation,2021,"Animation, Action, Adventure",5.0,The war for Eternia begins again in what may be the final battle between He-Man and Skeletor. A new animated series from writer-director Kevin Smith.,"Stars: Chris Wood, Sarah Michelle Gellar, Lena Headey, Mark Hamill",17870,25
The Walking Dead,2010,"Drama, Horror, Thriller",8.2,Sheriff Deputy Rick Grimes wakes up from a coma to learn the world is in ruins and must lead a group of survivors to stay alive.,"Stars: Andrew Lincoln, Norman Reedus, Melissa McBride, Lauren Cohan",885805,44
Rick and Morty,2013,"Animation, Adventure, Comedy",9.2,An animated series that follows the exploits of a super scientist and his not-so-bright grandson.,"Stars: Justin Roiland, Chris Parnell, Spencer Grammer, Sarah Chalke",414849,23
Army of Thieves,2021,"Action, Crime, Horror",7.1,"A prequel, set before the events of Army of the Dead, which focuses on German safecracker Ludwig Dieter leading a group of aspiring thieves on a top secret heist during the early stages of the zombie apocalypse.","Director: Matthias Schweighöfer | Stars: Matthias Schweighöfer, Nathalie Emmanuel, Ruby O. Fee, Stuart Martin",0,60
