# Spark Preparation
We check if we are in Google Colab.  If this is the case, install all necessary packages.

To run spark in Colab, we need to first install all the dependencies in Colab environment i.e. Apache Spark 3.3.2 with hadoop 3.3, Java 8 and Findspark to locate the spark in the system. The tools installation can be carried out inside the Jupyter Notebook of the Colab.
Learn more from [A Must-Read Guide on How to Work with PySpark on Google Colab for Data Scientists!](https://www.analyticsvidhya.com/blog/2020/11/a-must-read-guide-on-how-to-work-with-pyspark-on-google-colab-for-data-scientists/)

In [1]:
try:
  import google.colab
  IN_COLAB = True
except:
  IN_COLAB = False

In [8]:
import os

if IN_COLAB:
    !apt-get install openjdk-8-jdk-headless -qq > /dev/null
    !wget -q https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz
    !tar xf spark-3.3.2-bin-hadoop3.tgz
    !mv spark-3.3.2-bin-hadoop3 spark
    !pip install -q findspark
    os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
    os.environ["SPARK_HOME"] = "/content/spark"
else:
    os.environ["JAVA_HOME"] = "/opt/homebrew/opt/openjdk@17"
    os.environ["SPARK_HOME"] = "./spark"



# Start a Local Cluster

In [9]:
import findspark
findspark.init()

spark_url = 'local'

In [10]:
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext

spark = SparkSession.builder\
        .master(spark_url)\
        .appName('Spark Tutorial')\
        .config('spark.ui.port', '4040')\
        .getOrCreate()

25/03/25 21:28:22 WARN Utils: Your hostname, Idhibhats-MacBook-Air.local resolves to a loopback address: 127.0.0.1; using 192.168.1.137 instead (on interface en0)
25/03/25 21:28:22 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/03/25 21:28:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


# Spark Assignment

Based on the movie review dataset in 'netflix-rotten-tomatoes-metacritic-imdb.csv', answer the below questions.

**Note:** do not clean or remove missing data

In [17]:
df_imdb = spark.read.csv("netflix-rotten-tomatoes-metacritic-imdb.csv", header=True, inferSchema=True)

cols = [c.replace(' ', '_') for c in df_imdb.columns]
df_imdb = df_imdb.toDF(*cols)

In [18]:
df_imdb.printSchema()

root
 |-- Title: string (nullable = true)
 |-- Genre: string (nullable = true)
 |-- Tags: string (nullable = true)
 |-- Languages: string (nullable = true)
 |-- Series_or_Movie: string (nullable = true)
 |-- Hidden_Gem_Score: double (nullable = true)
 |-- Country_Availability: string (nullable = true)
 |-- Runtime: string (nullable = true)
 |-- Director: string (nullable = true)
 |-- Writer: string (nullable = true)
 |-- Actors: string (nullable = true)
 |-- View_Rating: string (nullable = true)
 |-- IMDb_Score: string (nullable = true)
 |-- Rotten_Tomatoes_Score: string (nullable = true)
 |-- Metacritic_Score: string (nullable = true)
 |-- Awards_Received: double (nullable = true)
 |-- Awards_Nominated_For: double (nullable = true)
 |-- Boxoffice: string (nullable = true)
 |-- Release_Date: string (nullable = true)
 |-- Netflix_Release_Date: string (nullable = true)
 |-- Production_House: string (nullable = true)
 |-- Netflix_Link: string (nullable = true)
 |-- IMDb_Link: string (null

In [25]:
df_imdb.show(5)

+-------------------+--------------------+--------------------+----------------+---------------+----------------+--------------------+------------+---------------+--------------------+--------------------+-----------+----------+---------------------+----------------+---------------+--------------------+----------+------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+--------------------+--------------------+--------------------+------------+
|              Title|               Genre|                Tags|       Languages|Series_or_Movie|Hidden_Gem_Score|Country_Availability|     Runtime|       Director|              Writer|              Actors|View_Rating|IMDb_Score|Rotten_Tomatoes_Score|Metacritic_Score|Awards_Received|Awards_Nominated_For| Boxoffice|Release_Date|Netflix_Release_Date|    Production_House|        Netflix_Link|           IMDb_Link|             Summary|IMDb_Votes|               Image|              

## What is the maximum and average of the overall hidden gem score?

In [24]:
from pyspark.sql.functions import avg, min, max
df_imdb.select(max("Hidden_Gem_Score"), avg("Hidden_Gem_Score")).show()

[Stage 8:>                                                          (0 + 1) / 1]

+---------------------+---------------------+
|max(Hidden_Gem_Score)|avg(Hidden_Gem_Score)|
+---------------------+---------------------+
|                  9.8|    5.937551386501226|
+---------------------+---------------------+



                                                                                

## How many movies that are available in Korea?

In [None]:
df_imdb.filter(df_imdb["Country_Availability"].like("%Korea%")).count()

4845

In [26]:
df_imdb.filter(df_imdb["Country_Availability"].contains("Korea")).count()

4845

## Which director has the highest average hidden gem score?

In [31]:
df_imdb.groupby("Director") \
    .agg(avg("Hidden_Gem_Score").alias("Avg_Hidden_Gem_Score")) \
    .sort("Avg_Hidden_Gem_Score", ascending=False) \
    .show()

+--------------------+--------------------+
|            Director|Avg_Hidden_Gem_Score|
+--------------------+--------------------+
|         Dorin Marcu|                 9.8|
|    Fernando Escovar|                 9.6|
|          Rosa Russo|                 9.5|
|         Kate Brooks|                 9.5|
|Vincent Bal, Kenn...|                 9.5|
|    Ignacio Busquier|                 9.5|
|Bill Butler, Will...|                 9.5|
|     Charles Officer|                 9.4|
|           Ryan Sage|                 9.3|
|   Frederico Machado|                 9.3|
|    Ashish R. Shukla|                 9.3|
|         Lisa France|                 9.3|
|Jacqui Morris, Da...|                 9.3|
|    Jan Philipp Weyl|                 9.3|
|      Aundre Johnson|                 9.3|
|        R.J. Bentler|                 9.3|
|     Rabeah Ghaffari|                 9.3|
|          Oh Jin-Koo|                 9.3|
|        Shinkyu Choi|                 9.3|
|         André Canto|          

## How many genres are there in the dataset?

In [51]:
from pyspark.sql import functions as F

In [53]:
all_genres = df_imdb.select(
    F.explode(
        F.split(F.col("Genre"), ",")
    ).alias("Genre")
)

all_genres.show()

+---------+
|    Genre|
+---------+
|    Crime|
|    Drama|
|  Fantasy|
|   Horror|
|  Romance|
|   Comedy|
|    Drama|
| Thriller|
|    Drama|
|Animation|
|    Short|
|    Drama|
|   Comedy|
|  Romance|
|    Drama|
|    Crime|
|    Drama|
|   Comedy|
|   Comedy|
|   Family|
+---------+
only showing top 20 rows



In [55]:
all_genres.select(F.trim(F.col("Genre"))).distinct().count()

28