# Readme

Before running, import into cluster FileStore/tables. All files should be `gzip`ed already.

* All of [IMDB Non-Commerical Dataset](https://datasets.imdbws.com/) in .gz format
    * `title.akas.tsv.gz`
    * `title.ratings.tsv.gz`
    * `title.principals.tsv.gz`
    * `title.episode.tsv.gz`
    * `title.crew.tsv.gz`
    * `title.basics.tsv.gz`
    * `name.basics.tsv.gz`
* All files from the [Kaggle Anime Dataset](https://www.kaggle.com/datasets/dbdmobile/myanimelist-dataset)
    * `anime_dataset.csv.gz`
        > This dataset contains comprehensive details of 24,905 anime entries.
    * `anime_filtered.csv.gz`
        > This dataset provide information about the different attributes and characteristics of each anime (Based on 2020 data).
    * `final_animedataset.csv.gz`
        * Note, this file needs to be compressed with gzip and uploaded by itself
        > This dataset contains user ratings and information about various anime titles. It is curated for building an anime recommendation system(Based on 2018 data).
    * `user_filtered.csv.gz`
        > This dataset contains the user's ratings for every anime they watched and rated(Based on 2020 data).
    * `users_details_2023.csv.gz`
        > This dataset comprises information on 731,290 users registered on the MyAnimeList platform. It is worth noting that while a significant portion of these users are genuine anime enthusiasts, there may be instances of bots, inactive accounts, and alternate profiles present within the dataset.
    * `users_score_2023.csv.gz`
        > This dataset comprises anime scores provided by 270,033 users, resulting in a total of 24,325,191 rows or samples.

# Import data

In [0]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

In [0]:
files = dbutils.fs.ls("dbfs:/FileStore/tables/")
for file in files:
    print(file.path)

dbfs:/FileStore/tables/anime_dataset_2023_csv.gz
dbfs:/FileStore/tables/anime_filtered_csv.gz
dbfs:/FileStore/tables/final_animedataset_csv.gz
dbfs:/FileStore/tables/name_basics_tsv.gz
dbfs:/FileStore/tables/title_akas.tsv
dbfs:/FileStore/tables/title_akas_tsv.gz
dbfs:/FileStore/tables/title_basics_tsv.gz
dbfs:/FileStore/tables/title_crew_tsv.gz
dbfs:/FileStore/tables/title_episode_tsv.gz
dbfs:/FileStore/tables/title_principals_tsv.gz
dbfs:/FileStore/tables/title_ratings_tsv.gz
dbfs:/FileStore/tables/user_filtered_csv.gz
dbfs:/FileStore/tables/users_details_2023_csv.gz
dbfs:/FileStore/tables/users_score_2023_csv.gz


## Import Kaggle

In [0]:
anime_dataset = spark.read.option("header", "true") \
    .option("inferSchema", "true") \
    .option("multiLine", "true") \
    .csv("dbfs:/FileStore/tables/anime_dataset_2023_csv.gz")

anime_filtered = spark.read.option("header", "true") \
    .option("inferSchema", "true") \
    .option("multiLine", "true") \
    .csv("dbfs:/FileStore/tables/anime_filtered_csv.gz")

final_animedataset = spark.read.option("header", "true") \
    .option("inferSchema", "true") \
    .option("multiLine", "true") \
    .csv("dbfs:/FileStore/tables/final_animedataset_csv.gz")

user_filtered = spark.read.option("header", "true") \
    .option("inferSchema", "true") \
    .option("multiLine", "true") \
    .csv("dbfs:/FileStore/tables/user_filtered_csv.gz")

users_details_2023 = spark.read.option("header", "true") \
    .option("inferSchema", "true") \
    .option("multiLine", "true") \
    .csv("dbfs:/FileStore/tables/users_details_2023_csv.gz")

users_score_2023 = spark.read.option("header", "true") \
    .option("inferSchema", "true") \
    .option("multiLine", "true") \
    .csv("dbfs:/FileStore/tables/users_score_2023_csv.gz")

# anime_dataset.csv.gz
# anime_filtered.csv.gz
# final_animedataset.csv.gz
# user_filtered.csv.gz
# users_details_2023.csv.gz
# users_score_2023.csv.gz

## Import IMDB

In [0]:
imdb_title = spark.read.option("header", "true").option("delimiter", "\t").option("inferSchema", "true").csv("dbfs:/FileStore/tables/title_akas_tsv.gz")
imdb_ratings = spark.read.option("header", "true").option("delimiter", "\t").option("inferSchema", "true").csv("dbfs:/FileStore/tables/title_ratings_tsv.gz")
imdb_principals = spark.read.option("header", "true").option("delimiter", "\t").option("inferSchema", "true").csv("dbfs:/FileStore/tables/title_principals_tsv.gz")
imdb_episode = spark.read.option("header", "true").option("delimiter", "\t").option("inferSchema", "true").csv("dbfs:/FileStore/tables/title_episode_tsv.gz")
imdb_crew = spark.read.option("header", "true").option("delimiter", "\t").option("inferSchema", "true").csv("dbfs:/FileStore/tables/title_crew_tsv.gz")
imdb_title_basics = spark.read.option("header", "true").option("delimiter", "\t").option("inferSchema", "true").csv("dbfs:/FileStore/tables/title_basics_tsv.gz")
imdb_name_basics = spark.read.option("header", "true").option("delimiter", "\t").option("inferSchema", "true").csv("dbfs:/FileStore/tables/name_basics_tsv.gz")


# Inspect data

## IMDB Data

In [0]:
imdb_title.printSchema()
imdb_title.count()
imdb_title.show(n=5)

root
 |-- titleId: string (nullable = true)
 |-- ordering: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- region: string (nullable = true)
 |-- language: string (nullable = true)
 |-- types: string (nullable = true)
 |-- attributes: string (nullable = true)
 |-- isOriginalTitle: string (nullable = true)

+---------+--------+--------------------+------+--------+-----------+-------------+---------------+
|  titleId|ordering|               title|region|language|      types|   attributes|isOriginalTitle|
+---------+--------+--------------------+------+--------+-----------+-------------+---------------+
|tt0000001|       1|          Карменсіта|    UA|      \N|imdbDisplay|           \N|              0|
|tt0000001|       2|          Carmencita|    DE|      \N|         \N|literal title|              0|
|tt0000001|       3|Carmencita - span...|    HU|      \N|imdbDisplay|           \N|              0|
|tt0000001|       4|          Καρμενσίτα|    GR|      \N|imdbDisplay|    

In [0]:
imdb_title.filter(F.lower(F.col("title")) == 'cowboy bebop').show()

+---------+--------+------------+------+--------+-----------+----------+---------------+
|  titleId|ordering|       title|region|language|      types|attributes|isOriginalTitle|
+---------+--------+------------+------+--------+-----------+----------+---------------+
|tt0213338|      10|Cowboy Bebop|    AU|      \N|imdbDisplay|        \N|              0|
|tt0213338|      12|Cowboy Bebop|    IL|      en|imdbDisplay|        \N|              0|
|tt0213338|      14|Cowboy Bebop|    FI|      \N|imdbDisplay|        \N|              0|
|tt0213338|      15|Cowboy Bebop|    ID|      en|imdbDisplay|        \N|              0|
|tt0213338|      17|Cowboy Bebop|    DE|      \N|imdbDisplay|        \N|              0|
|tt0213338|      19|Cowboy Bebop|    EC|      \N|imdbDisplay|        \N|              0|
|tt0213338|       1|Cowboy Bebop|    IT|      \N|imdbDisplay|        \N|              0|
|tt0213338|      21|Cowboy Bebop|    US|      \N|imdbDisplay|        \N|              0|
|tt0213338|      22|C

In [0]:
imdb_title.filter(F.col("titleId") == "tt0213338").show()

+---------+--------+--------------------+------+--------+-----------+----------+---------------+
|  titleId|ordering|               title|region|language|      types|attributes|isOriginalTitle|
+---------+--------+--------------------+------+--------+-----------+----------+---------------+
|tt0213338|      10|        Cowboy Bebop|    AU|      \N|imdbDisplay|        \N|              0|
|tt0213338|      11|      Kaubôi bibappu|    AE|      \N|imdbDisplay|        \N|              0|
|tt0213338|      12|        Cowboy Bebop|    IL|      en|imdbDisplay|        \N|              0|
|tt0213338|      13|        Ковбой Бибоп|    RU|      \N|imdbDisplay|        \N|              0|
|tt0213338|      14|        Cowboy Bebop|    FI|      \N|imdbDisplay|        \N|              0|
|tt0213338|      15|        Cowboy Bebop|    ID|      en|imdbDisplay|        \N|              0|
|tt0213338|      16|        Kowboj Bebop|    PL|      \N|imdbDisplay|        \N|              0|
|tt0213338|      17|        Co

In [0]:
imdb_title.filter((F.col("titleId") == "tt0213338") & (F.col("region") == "US")).show()

+---------+--------+------------+------+--------+-----------+----------+---------------+
|  titleId|ordering|       title|region|language|      types|attributes|isOriginalTitle|
+---------+--------+------------+------+--------+-----------+----------+---------------+
|tt0213338|      21|Cowboy Bebop|    US|      \N|imdbDisplay|        \N|              0|
+---------+--------+------------+------+--------+-----------+----------+---------------+



In [0]:
imdb_title_basics.filter(F.col("tconst") == "tt0213338").show(truncate=False)

+---------+---------+------------+----------------------------+-------+---------+-------+--------------+--------------------------+
|tconst   |titleType|primaryTitle|originalTitle               |isAdult|startYear|endYear|runtimeMinutes|genres                    |
+---------+---------+------------+----------------------------+-------+---------+-------+--------------+--------------------------+
|tt0213338|tvSeries |Cowboy Bebop|Kaubôi bibappu: Cowboy Bebop|0      |1998     |1999   |650           |Action,Adventure,Animation|
+---------+---------+------------+----------------------------+-------+---------+-------+--------------+--------------------------+



In [0]:
imdb_ratings.filter(F.col("tconst") == "tt0213338").show(truncate=False)

+---------+-------------+--------+
|tconst   |averageRating|numVotes|
+---------+-------------+--------+
|tt0213338|8.9          |136589  |
+---------+-------------+--------+



In [0]:
imdb_episode.filter(F.col("parentTconst") == "tt0213338").show(truncate=False)

+---------+------------+------------+-------------+
|tconst   |parentTconst|seasonNumber|episodeNumber|
+---------+------------+------------+-------------+
|tt0618963|tt0213338   |1           |1            |
|tt0618964|tt0213338   |1           |5            |
|tt0618965|tt0213338   |1           |23           |
|tt0618966|tt0213338   |1           |22           |
|tt0618967|tt0213338   |1           |10           |
|tt0618968|tt0213338   |1           |4            |
|tt0618969|tt0213338   |1           |24           |
|tt0618970|tt0213338   |1           |7            |
|tt0618971|tt0213338   |1           |3            |
|tt0618972|tt0213338   |1           |9            |
|tt0618973|tt0213338   |1           |12           |
|tt0618974|tt0213338   |1           |17           |
|tt0618975|tt0213338   |1           |15           |
|tt0618976|tt0213338   |1           |20           |
|tt0618977|tt0213338   |1           |18           |
|tt0618978|tt0213338   |1           |2            |
|tt0618979|t

## Anime dataset

In [0]:
anime_dataset.count()
anime_dataset.printSchema()
anime_dataset.show(n=5, truncate=False)
anime_dataset.filter(F.col('anime_id') == '1').collect()

root
 |-- anime_id: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- English name: string (nullable = true)
 |-- Other name: string (nullable = true)
 |-- Score: string (nullable = true)
 |-- Genres: string (nullable = true)
 |-- Synopsis: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- Episodes: string (nullable = true)
 |-- Aired: string (nullable = true)
 |-- Premiered: string (nullable = true)
 |-- Status: string (nullable = true)
 |-- Producers: string (nullable = true)
 |-- Licensors: string (nullable = true)
 |-- Studios: string (nullable = true)
 |-- Source: string (nullable = true)
 |-- Duration: string (nullable = true)
 |-- Rating: string (nullable = true)
 |-- Rank: string (nullable = true)
 |-- Popularity: string (nullable = true)
 |-- Favorites: string (nullable = true)
 |-- Scored By: string (nullable = true)
 |-- Members: string (nullable = true)
 |-- Image URL: string (nullable = true)

+-----------------------------------------------

In [0]:
anime_filtered.printSchema()
anime_filtered.count()
anime_filtered.show(n=5)
anime_filtered.filter(F.col('anime_id') == 1).collect()

root
 |-- anime_id: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Score: string (nullable = true)
 |-- Genres: string (nullable = true)
 |-- English name: string (nullable = true)
 |-- Japanese name: string (nullable = true)
 |-- sypnopsis: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- Episodes: string (nullable = true)
 |-- Aired: string (nullable = true)
 |-- Premiered: string (nullable = true)
 |-- Producers: string (nullable = true)
 |-- Licensors: string (nullable = true)
 |-- Studios: string (nullable = true)
 |-- Source: string (nullable = true)
 |-- Duration: string (nullable = true)
 |-- Rating: string (nullable = true)
 |-- Ranked: string (nullable = true)
 |-- Popularity: string (nullable = true)
 |-- Members: string (nullable = true)
 |-- Favorites: string (nullable = true)
 |-- Watching: string (nullable = true)
 |-- Completed: string (nullable = true)
 |-- On-Hold: string (nullable = true)
 |-- Dropped: string (nullable = true)

+-

In [0]:

final_animedataset.printSchema()
final_animedataset.count()
final_animedataset.show(n=5)
final_animedataset.filter(F.col('anime_id') == 1)

root
 |-- username: string (nullable = true)
 |-- anime_id: integer (nullable = true)
 |-- my_score: integer (nullable = true)
 |-- user_id: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- title: string (nullable = true)
 |-- type: string (nullable = true)
 |-- source: string (nullable = true)
 |-- score: string (nullable = true)
 |-- scored_by: double (nullable = true)
 |-- rank: double (nullable = true)
 |-- popularity: double (nullable = true)
 |-- genre: string (nullable = true)

+--------+--------+--------+-------+------+-------------+----+------+-----+---------+------+----------+--------------------+
|username|anime_id|my_score|user_id|gender|        title|type|source|score|scored_by|  rank|popularity|               genre|
+--------+--------+--------+-------+------+-------------+----+------+-----+---------+------+----------+--------------------+
|karthiga|      21|       9|2255153|Female|    One Piece|  TV| Manga| 8.54| 423868.0|  91.0|      35.0|Action, Adve

In [0]:
user_filtered.printSchema()
user_filtered.count()
user_filtered.show(n=5)

root
 |-- user_id: integer (nullable = true)
 |-- anime_id: integer (nullable = true)
 |-- rating: integer (nullable = true)

+-------+--------+------+
|user_id|anime_id|rating|
+-------+--------+------+
|      0|      67|     9|
|      0|    6702|     7|
|      0|     242|    10|
|      0|    4898|     0|
|      0|      21|    10|
+-------+--------+------+
only showing top 5 rows



In [0]:
users_details_2023.printSchema()
users_details_2023.count()
users_details_2023.show(n=5)

root
 |-- Mal ID: integer (nullable = true)
 |-- Username: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Birthday: timestamp (nullable = true)
 |-- Location: string (nullable = true)
 |-- Joined: string (nullable = true)
 |-- Days Watched: string (nullable = true)
 |-- Mean Score: string (nullable = true)
 |-- Watching: double (nullable = true)
 |-- Completed: double (nullable = true)
 |-- On Hold: double (nullable = true)
 |-- Dropped: double (nullable = true)
 |-- Plan to Watch: double (nullable = true)
 |-- Total Entries: double (nullable = true)
 |-- Rewatched: double (nullable = true)
 |-- Episodes Watched: double (nullable = true)

+------+--------+------+-------------------+--------------------+--------------------+------------+----------+--------+---------+-------+-------+-------------+-------------+---------+----------------+
|Mal ID|Username|Gender|           Birthday|            Location|              Joined|Days Watched|Mean Score|Watching|Completed|On

In [0]:
users_score_2023.printSchema()
users_score_2023.count()
users_score_2023.show(n=5)

root
 |-- user_id: integer (nullable = true)
 |-- Username: string (nullable = true)
 |-- anime_id: integer (nullable = true)
 |-- Anime Title: string (nullable = true)
 |-- rating: string (nullable = true)

+-------+--------+--------+--------------------+------+
|user_id|Username|anime_id|         Anime Title|rating|
+-------+--------+--------+--------------------+------+
|      1|   Xinil|      21|           One Piece|     9|
|      1|   Xinil|      48|         .hack//Sign|     7|
|      1|   Xinil|     320|              A Kite|     5|
|      1|   Xinil|      49|    Aa! Megami-sama!|     8|
|      1|   Xinil|     304|Aa! Megami-sama! ...|     8|
+-------+--------+--------+--------------------+------+
only showing top 5 rows



## Join data

In [0]:
imdb_title_us = imdb_title.filter(F.col("region") == "US")

# Join the filtered imdb_title DataFrame with the anime_dataset DataFrame
# Assuming both DataFrames have a column named "title" for the join condition
joined_df = anime_dataset.join(imdb_title_us, anime_dataset["English name"] == imdb_title_us["title"], how="inner")

filtered_df = joined_df.filter(F.col('title') == 'Sleeping Beauty')
row_count = filtered_df.count()
print(row_count)

96
