## Task three

The client requires functionality to allow users to query their dataset. 

The user should be able to use any number of columns, 
alongside a range of values for each column to filter films. 

For example, the user may want to find all films by Quentin Tarantino or George Lucas
that were released between 1979 and 2000. 

The user should be able to use a variety of querying methods, such as between a range of values, or from a set of specific values, where appropriate. 

The dataset that should be used is your output from Task one, and the system you build should be flexible and reusable. 

You must build just the backend for this functionality. You should create some suitable test cases to demonstrate this unctionality is working as intended.

In [1]:
import pyspark.sql
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
from pyspark.sql.types import DoubleType

In [2]:
from pyspark.ml.feature import HashingTF, IDF
from pyspark.ml.feature import Normalizer
from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix

In [3]:
spark = SparkSession.builder \
    .master('local[*]') \
    .config("spark.driver.memory", "15g") \
    .appName('imdb-munging') \
    .getOrCreate()

sc = spark.sparkContext

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/12/22 13:34:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/12/22 13:34:20 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [4]:
# load the IMDb films data prepared in previous task
input_path = "../output/films"
df_film = spark.read.parquet(input_path)


In [5]:
df_film.printSchema()

root
 |-- film_id: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- year: date (nullable = true)
 |-- duration: integer (nullable = true)
 |-- genres: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- rating: decimal(4,2) (nullable = true)
 |-- vote_count: integer (nullable = true)
 |-- persons: array (nullable = true)
 |    |-- element: string (containsNull = true)



In [6]:
# reduce the row count... some filtering
#df.sample(withReplacement=False, fraction=0.10, seed=2).show(truncate=False)
#df = df.filter( df.year >= '2020-01-01')\
#  .filter(~(f.array_contains( df['genres'], 'Documentary')) )

df_film.count()

39427

search by:
- title
- year (range)
- genres
- persons (any actor, direct, producer)


In [7]:
# get a list of valid genres

df_film.groupBy('genres').count().sort(f.desc('count')).show(truncate=False)

feature_cols = [c for c in df_film.columns if c != 'film_id']
print(feature_cols)


                                                                                

+------------------------------+-----+
|genres                        |count|
+------------------------------+-----+
|[Drama]                       |4829 |
|[Comedy, Drama]               |2153 |
|[Drama, Romance]              |1981 |
|[Comedy]                      |1719 |
|[Comedy, Drama, Romance]      |1542 |
|[Documentary]                 |1469 |
|[Comedy, Romance]             |972  |
|[Action, Crime, Drama]        |742  |
|[Crime, Drama]                |639  |
|[Crime, Drama, Thriller]      |615  |
|[Drama, Thriller]             |576  |
|[Crime, Drama, Mystery]       |505  |
|[Biography, Drama, History]   |418  |
|[Drama, War]                  |403  |
|[Comedy, Crime, Drama]        |376  |
|[Crime, Drama, Film-Noir]     |356  |
|[Action, Adventure, Animation]|331  |
|[Action, Adventure, Drama]    |322  |
|[Action, Comedy, Crime]       |316  |
|[Adventure, Animation, Comedy]|315  |
+------------------------------+-----+
only showing top 20 rows

['title', 'year', 'duration', 'genres'

In [33]:
#df.filter(~(f.array_contains( df['genres'], 'Horror'))).count()
#df.filter((f.array_contains( df['genres'], 'Thriller')) & (df['year'] >= '2000-01-01')).show(20, truncate=False) 
#df.filter((f.exists(df['persons'], lambda x: x == 'Arnold')) & (df['year'] >= '1980-01-01')).show(20, truncate=False) 

df_film.filter(  f.regexp_count( f.array_join('persons', ','), f.lit(r'(?i)Keanu') ) >= 1 ).show(50, False) 
#df_film.filter(  f.regexp_count( f.array_join('persons', ','), f.lit(r'(?i)John Williams') ) >= 1 ).show(50, False) 

#df_film.filter(  df_film['title'].rlike(r'(?i)home alone') ).show(20, False) 
#df_film.filter(  df_film['title'].rlike(r'(?i)Maze Runner') ).show(20, False) 


+--------+---------------------------------+----------+--------+------------------------------+------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|film_id |title                            |year      |duration|genres                        |rating|vote_count|persons                                                                                                                                                                                                                                                                                                                                                                

In [9]:
df_film.show(20, truncate=False)

+-------+-------------------------------------------------+----------+--------+------------------------------+------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|film_id|title                                            |year      |duration|genres                        |rating|vote_count|persons                                                                                                                                                                                                                                                                                                                            |
+-------+-------------------------------------------------+----------+--------+---------------

In [10]:
df_film.groupBy('genres').count().sort(f.desc('count')).show(truncate=False)


+------------------------------+-----+
|genres                        |count|
+------------------------------+-----+
|[Drama]                       |4829 |
|[Comedy, Drama]               |2153 |
|[Drama, Romance]              |1981 |
|[Comedy]                      |1719 |
|[Comedy, Drama, Romance]      |1542 |
|[Documentary]                 |1469 |
|[Comedy, Romance]             |972  |
|[Action, Crime, Drama]        |742  |
|[Crime, Drama]                |639  |
|[Crime, Drama, Thriller]      |615  |
|[Drama, Thriller]             |576  |
|[Crime, Drama, Mystery]       |505  |
|[Biography, Drama, History]   |418  |
|[Drama, War]                  |403  |
|[Comedy, Crime, Drama]        |376  |
|[Crime, Drama, Film-Noir]     |356  |
|[Action, Adventure, Animation]|331  |
|[Action, Adventure, Drama]    |322  |
|[Action, Comedy, Crime]       |316  |
|[Adventure, Animation, Comedy]|315  |
+------------------------------+-----+
only showing top 20 rows



In [11]:
feature_cols = [c for c in df_film.columns if c != 'film_id']
print(feature_cols)
feature_cols = ['title', 'persons', 'genres', 'year', 'duration']
print(feature_cols)

['title', 'year', 'duration', 'genres', 'rating', 'vote_count', 'persons']
['title', 'persons', 'genres', 'year', 'duration']


## stage 2

In [13]:
# Load the cosine similarity (martix dot product) data

# >= 2024 -> 12_590 rows -> 79_247_755 normals (just for this year!)
sc.setLogLevel("WARN")

# load the IMDb films data prepared in previous task
input_path = "../output/csfilm"
df_cos_sim = spark.read.parquet(input_path)

df_cos_sim.count()

777224451

In [14]:
#df_cos_sim = df_cos_sim.withColumnsRenamed({'i': 'film_id', 'j': 'other_id', 'dot': 'similarity'})
df_cos_sim.printSchema()

root
 |-- film_id: integer (nullable = true)
 |-- other_id: integer (nullable = true)
 |-- similarity: double (nullable = true)



In [15]:
# Function to detect similarities between films
from pyspark.sql import DataFrame

def get_similar_films(film_id:int, threshold:float=0.1) -> DataFrame:
    df_film.filter(f.col('film_id') == film_id).show(truncate=False)
    df_rec = df_cos_sim.alias('reco')\
            .filter( (f.col('reco.film_id') == film_id) & (f.col('reco.similarity') >= threshold))\
            .join(df_film.alias('films'), f.col('reco.other_id') == f.col('films.film_id'), how='left')\
            .sort(f.desc('reco.similarity'))\
            .limit(10)\
            .show(truncate=False)

    return df_rec


In [35]:
#get_similar_films(1392170)
#get_similar_films(99785) # Home Alone
#get_similar_films(88763) # "BTTF"
#get_similar_films(990372, 0.45) # "Detective Conan"
get_similar_films(462699) # "Conan the Future Boy"
#get_similar_films(1392170) # "The Hunger Games"
#get_similar_films(99785) # Home Alone
#get_similar_films(11286314) # "Don't Look Up"
#get_similar_films(15410318, 0.1) # "Amy's bucket list"
#get_similar_films(1517268, 0.05) # "Barbie"
#get_similar_films(120915, 0.1) # Star Wars I
#get_similar_films(133093, 0.2) # The Matrix
#get_similar_films(6791350) # Guardians of the Galaxy 3
#get_similar_films(film_id=11286314, threshold=0.1)

#get_similar_films() # 

+-------+--------------------------------------------------------+----------+--------+------------------------------+------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|film_id|title                                                   |year      |duration|genres                        |rating|vote_count|persons                                                                                                                                                                                                                                                                                  |
+-------+--------------------------------------------------------+----------+--------+------------------------------+------+----------+-----------------------------

In [71]:
sc.stop()