# <span style="font-size: 1em">Spark</span><span style="font-size: 0.8em"> Assignment</span>
<h3>Big Data Systems 2022-2023</h3>
<h5>M.Sc. In Business Analytics (Part Time) 2022-2024 at Athens University of Economics and Business (A.U.E.B.)</h5>
<hr>

> Student: Panagiotis G. Vaidomarkakis<br />
> Student I.D.: p2822203<br />
> Tutor: Thanasis Vergoulis<br />
> Due Date: 15/04/2023

## Table Of Contents:
* [Importing Libraries](#first-bullet)
* [$1^{st}$ Question](#q1)
* [$2^{nd}$ Question](#q2)
* [$3^{rd}$ Question](#q3)
* [$4^{th}$ Question](#q4)
* [$5^{th}$ Question](#q5)
* [$2^{nd}$ way to do all the calculations (Q4 & Q5)](#second-way)

## Importing Libraries <a class="anchor" id="first-bullet"></a>
In the following lines, we will import all the nessecary liblaries in order to be able to execute all the following commands. <br> First, we will run a check to see if the PC containing this Jupiter Notebook file has all the necessary libraries and if it hasn't, it will automatically download them in order to import them:

In [1]:
import importlib
import subprocess

def install_library(lib):
    try:
        importlib.import_module(lib)
        print(f'{lib} is already installed.')
    except ImportError:
        print(f'{lib} is not installed. Installing now...')
        subprocess.call(['pip ', 'install ', lib])

libraries = ['pyspark','pyspark.sql','pyspark.sql.functions']

for lib in libraries:
    install_library(lib)

pyspark is already installed.
pyspark.sql is already installed.
pyspark.sql.functions is already installed.


In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr, count, desc

## $1^{st}$ Question <a class="anchor" id="q1"></a>
Use the *json()* function to load the dataset.

In [3]:
# Create a SparkSession
spark = SparkSession.builder.appName('Loading JSON Data').getOrCreate()

If the above doesn't work, you can download Java jdk from HERE, uncomment the following cell and execute it. <br>
After that, restart python kernel and re-execute the whole notebook until here to work.

In [4]:
# import os

# # Find and use your java_home directory. Usually is something like below.
# JAVA_HOME = "C:/Program Files/Eclipse Adoptium/jdk-17.0.6.10-hotspot"
# os.environ["JAVA_HOME"] = JAVA_HOME
# os.environ["PATH"] = f"{JAVA_HOME}/bin:{os.environ['PATH']}"

In [5]:
# Load the JSON data as a DataFrame using the json() function and print the schema to see if everything is fine
json_movie = spark.read.json('movie.json')
json_movie.printSchema()

root
 |-- _corrupt_record: string (nullable = true)
 |-- actors: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- countries: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- description: string (nullable = true)
 |-- directors: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- genre: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- imdb_url: string (nullable = true)
 |-- img_url: string (nullable = true)
 |-- languages: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- metascore: string (nullable = true)
 |-- rating: string (nullable = true)
 |-- runtime: string (nullable = true)
 |-- tagline: string (nullable = true)
 |-- title: string (nullable = true)
 |-- users_rating: string (nullable = true)
 |-- votes: string (nullable = true)
 |-- year: string (nullable = true)



In [6]:
# Print the first 5 rows of the DataFrame to verify that it was loaded correctly
json_movie.show(5)

+---------------+--------------------+---------+--------------------+--------------------+------------------+--------------------+--------------------+-----------------+---------+------+-------+--------------------+--------------------+------------+-------+----+
|_corrupt_record|              actors|countries|         description|           directors|             genre|            imdb_url|             img_url|        languages|metascore|rating|runtime|             tagline|               title|users_rating|  votes|year|
+---------------+--------------------+---------+--------------------+--------------------+------------------+--------------------+--------------------+-----------------+---------+------+-------+--------------------+--------------------+------------+-------+----+
|              [|                null|     null|                null|                null|              null|                null|                null|             null|     null|  null|   null|                n

## $2^{nd}$ Question <a class="anchor" id="q2"></a>
Count and display the <b>number of movies</b> in the database.

In [7]:
# Count the total number of movies
total_movies = json_movie.count()
total_movies

62060

## $3^{rd}$ Question <a class="anchor" id="q3"></a>

Count and display the <b>number of comedies</b> in the database.

In [8]:
# Count the number of comedies
comedy_count = json_movie.filter("array_contains(genre, 'Comedy')").count()
comedy_count

19312

## $4^{th}$ Question <a class="anchor" id="q4"></a>

Use the *summary()* command to display basic statistics about the *<b>“users_rating”</b>* field.

In [9]:
# Use the summary() function to display basic statistics about the "users_rating" field
json_movie.select("users_rating").summary().show()

+-------+------------------+
|summary|      users_rating|
+-------+------------------+
|  count|             62056|
|   mean| 5.814105001933733|
| stddev|1.3521864101722219|
|    min|               1.0|
|    25%|               5.0|
|    50%|               6.0|
|    75%|               6.7|
|    max|               9.9|
+-------+------------------+



## $5^{th}$ Question <a class="anchor" id="q5"></a>

Use the *groupby()* and *count()* commands to display <b>all distinct values</b> in the *<b>“rating”</b>* field and their <b>number of appearances</b>.

In [10]:
# Use groupby() and count() to display all distinct values in the "rating" field and their number of appearances
rating_counts = json_movie.groupBy("rating").agg(count("*").alias("count")).orderBy(desc("count"))
rating_counts.show(31,False)

+---------+-----+
|rating   |count|
+---------+-----+
|null     |20669|
|R        |11368|
|Not Rated|8080 |
|Approved |6419 |
|Passed   |4488 |
|PG-13    |3771 |
|PG       |3286 |
|Unrated  |1295 |
|G        |801  |
|TV-MA    |639  |
|TV-14    |452  |
|TV-PG    |268  |
|X        |152  |
|TV-G     |115  |
|GP       |105  |
|M        |41   |
|M/PG     |27   |
|NC-17    |22   |
|TV-Y     |16   |
|TV-Y7    |14   |
|UA       |7    |
|A        |6    |
|Open     |4    |
|U        |4    |
|C        |3    |
|AO       |2    |
|E        |2    |
|18       |1    |
|(Banned) |1    |
|TV-13    |1    |
|All      |1    |
+---------+-----+



## $2^{nd}$ way to do all the calculations (Q4 & Q5)<a class="anchor" id="second-way"></a>

In [11]:
# Create a temporary view of the DataFrame
json_movie.createOrReplaceTempView("movies")

In [12]:
# Use SQL to display basic statistics about the "users_rating" field
spark.sql("SELECT COUNT(users_rating) as count, AVG(users_rating) as mean, STDDEV(users_rating) as stddev, MIN(users_rating) as min, percentile(users_rating, 0.25) as `25%`, percentile(users_rating, 0.5) as `50%`, percentile(users_rating, 0.75) as `75%`, MAX(users_rating) as max FROM movies").show(10,False,True)

-RECORD 0--------------------
 count  | 62056              
 mean   | 5.814105001933733  
 stddev | 1.3521864101722219 
 min    | 1.0                
 25%    | 5.0                
 50%    | 6.0                
 75%    | 6.7                
 max    | 9.9                



In [13]:
# Use SQL to display all distinct values in the "rating" field and their number of appearances
spark.sql("SELECT rating, COUNT(*) as count FROM movies GROUP BY rating ORDER BY count DESC").show(31,False)

+---------+-----+
|rating   |count|
+---------+-----+
|null     |20669|
|R        |11368|
|Not Rated|8080 |
|Approved |6419 |
|Passed   |4488 |
|PG-13    |3771 |
|PG       |3286 |
|Unrated  |1295 |
|G        |801  |
|TV-MA    |639  |
|TV-14    |452  |
|TV-PG    |268  |
|X        |152  |
|TV-G     |115  |
|GP       |105  |
|M        |41   |
|M/PG     |27   |
|NC-17    |22   |
|TV-Y     |16   |
|TV-Y7    |14   |
|UA       |7    |
|A        |6    |
|Open     |4    |
|U        |4    |
|C        |3    |
|AO       |2    |
|E        |2    |
|18       |1    |
|(Banned) |1    |
|TV-13    |1    |
|All      |1    |
+---------+-----+

