# <span style="font-size: 1em">Spark</span><span style="font-size: 0.8em"> Assignment</span>
<h3>Big Data Systems 2022-2023</h3>
<h5>M.Sc. In Business Analytics (Part Time) 2022-2024 at Athens University of Economics and Business (A.U.E.B.)</h5>
<hr>

> Student: Panagiotis G. Vaidomarkakis<br />
> Student I.D.: p2822203<br />
> Tutor: Thanasis Vergoulis<br />
> Due Date: 15/04/2023

## Table Of Contents:
* [Importing Libraries](#first-bullet)
* [$1^{st}$ Question](#q1)
* [$2^{nd}$ Question](#q2)
* [$3^{rd}$ Question](#q3)

## Importing Libraries <a class="anchor" id="first-bullet"></a>
In the following lines, we will import all the nessecary liblaries in order to be able to execute all the following commands. <br> First, we will run a check to see if the PC containing this Jupiter Notebook file has all the necessary libraries and if it hasn't, it will automatically download them in order to import them:

In [1]:
import importlib
import subprocess

def install_library(lib):
    try:
        importlib.import_module(lib)
        print(f'{lib} is already installed.')
    except ImportError:
        print(f'{lib} is not installed. Installing now...')
        subprocess.call(['pip ', 'install ', lib])

libraries = ['pyspark','pyspark.sql','pyspark.sql.functions']

for lib in libraries:
    install_library(lib)

pyspark is already installed.
pyspark.sql is already installed.
pyspark.sql.functions is already installed.


In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

## $1^{st}$ Question <a class="anchor" id="q1"></a>
Use the *json()* function to load the dataset.<br>
After that, return the *<b>“title” & “year”</b>* of the movie with the <b>largest *“users_rating”*</b> that its
title starts with the *<b>first letter</b>* of your *<b>last name</b>*.

In [3]:
# Create a SparkSession
spark = SparkSession.builder.appName('Loading JSON Data').getOrCreate()

In [4]:
# Load the JSON data as a DataFrame using the json() function and create a temporary view movies
json_movie = spark.read.json('movie.json')
json_movie.createOrReplaceTempView("movies")
json_movie.show(5)

In [5]:
# filter the DataFrame to get all movies that start with "V"
filtered_json_movie = json_movie.filter(col("title").startswith("V"))

# get the highest user rating movie
max_rating = filtered_json_movie.agg({"users_rating": "max"}).collect()[0][0]
highest_rated_movie = filtered_json_movie.filter(col("users_rating") == max_rating).first()

# print the title and year of the highest rated movie
print("Title:", highest_rated_movie.title)
print("Year:", highest_rated_movie.year)

Title: Violet
Year: 2020


Using SparkSQL below:

In [6]:
# Use SQL to return the title and year of the movie with the largest users_rating that its title starts with 'V'
spark.sql("SELECT title, year FROM movies WHERE title LIKE 'V%' ORDER BY users_rating DESC LIMIT 1").show(2,False,True)

-RECORD 0-------
 title | Violet 
 year  | 2020   



## $2^{nd}$ Question <a class="anchor" id="q2"></a>
Return the *<b>average “users_rating”</b>* of the movies that their title starts with the *<b>second</b>* letter of your *<b>last name</b>*.

In [7]:
# filter the DataFrame to get movies that have second letter 'A'
filtered_json_movie = json_movie.filter(col("title").substr(2, 1) == "A")
average_rating = filtered_json_movie.agg({"users_rating": "avg"}).collect()[0][0]

# print the average users_rating
print("Average users_rating of movies that have second letter 'A' of my last name:", average_rating)

Average users_rating of movies that have second letter 'A' of my last name: 6.56


Using SparkSQL below:

In [8]:
# calculate using SparkSQL the average users_rating of movies that have second letter 'A'
result_json_movie = spark.sql("SELECT AVG(users_rating) as avg_rating FROM movies WHERE SUBSTR(title, 2, 1) = 'A'")
average_rating = result_json_movie.collect()[0].avg_rating

# print the average users_rating
print("Average users_rating of movies that have second letter 'A' of my last name:", average_rating)

Average users_rating of movies that have second letter 'A' of my last name: 6.56


## $3^{rd}$ Question <a class="anchor" id="q3"></a>
Return the *<b>“title” & “year”</b>* of the movie with the *<b>most votes</b>*, when only movies with title starting with the *<b>third</b>* letter of your *<b>last name</b>* are considered.

In [9]:
# filter the DataFrame to get movies that have third letter 'I'
filtered_json_movie = json_movie.filter(col("title").substr(3, 1) == "I")

# get the movie with the most votes
most_voted_movie = filtered_json_movie.orderBy(col("votes").desc()).first()

# print the title and year of the most voted movie
print("Title:", most_voted_movie.title)
print("Year:", most_voted_movie.year)

Title: L.I.E.
Year: 2001


Using SparkSQL below:

In [10]:
# get the title and year of the movie with the most votes, when only movies with a title that have third letter 'I' are considered using Spark SQL
spark.sql("SELECT title, year FROM movies WHERE SUBSTR(title, 3, 1) = 'I' ORDER BY votes DESC LIMIT 1").show(2,False,True)

-RECORD 0-------
 title | L.I.E. 
 year  | 2001   

