# group listing

This notebook is to figure out what metadata that we'd like to show for group recommendations.
Given a group, what are other similar group?
In addition, given a group, what are the manga that they've translated?
And finally, given a group, what are some similar manga that they might translate?

In [38]:
%load_ext autoreload
%autoreload 2
%load_ext lab_black
from pyspark.sql import functions as F, Window
from manga_recsys.spark import get_spark

spark = get_spark()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
The lab_black extension is already loaded. To reload it, use:
  %reload_ext lab_black


In [2]:
group = spark.read.parquet("../data/processed/2022-12-17-mangadex-group.parquet")
group.printSchema()

root
 |-- attributes: struct (nullable = true)
 |    |-- altNames: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- en: string (nullable = true)
 |    |-- contactEmail: string (nullable = true)
 |    |-- createdAt: string (nullable = true)
 |    |-- description: string (nullable = true)
 |    |-- discord: string (nullable = true)
 |    |-- focusedLanguages: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- inactive: boolean (nullable = true)
 |    |-- ircChannel: string (nullable = true)
 |    |-- ircServer: string (nullable = true)
 |    |-- locked: boolean (nullable = true)
 |    |-- mangaUpdates: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- official: boolean (nullable = true)
 |    |-- publishDelay: string (nullable = true)
 |    |-- twitter: string (nullable = true)
 |    |-- updatedAt: string (nullable = true)
 |    |-- verified: boolean (nullable = true)
 |    |-- v

In [16]:
group_names = group.select(
    F.col("id").alias("group_id"), F.col("attributes.name").alias("group_name")
)
group_names.show(5, False)

+------------------------------------+-------------------+
|group_id                            |group_name         |
+------------------------------------+-------------------+
|c6931ee7-b4cd-44da-a52b-c8d1a90db4d2|LaSecteDuScan      |
|3eef1981-4ab5-434c-a13a-8128351447b7|Alive Scans        |
|145f9110-0a6c-4b71-8737-6acb1a3c5da4|Unknown            |
|7f4ea5d0-6af4-48a4-b56c-c7240668096b|Effortposting Scans|
|c1a3aadb-8b80-4456-93b7-68ba90f819ce|Saikai Scan        |
+------------------------------------+-------------------+
only showing top 5 rows



Try to find all the manga that a group has scanned.

In [28]:
manga = spark.read.parquet("../data/processed/2022-12-10-mangadex-manga.parquet")

In [43]:
manga_relationships = manga.select(
    "id", F.explode("relationships").alias("relationship")
)
manga_relationships.show(5, False)
manga_relationships.groupBy("relationship.type").count().show()

# what is the most common manga name type?

manga_name_lang = manga.select(
    F.col("id").alias("manga_id"), F.explode("attributes.title").alias("lang", "name")
)
manga_name_lang.show(5, False)

lang_ordered = (
    manga_name_lang.groupBy("lang")
    .count()
    .orderBy(F.desc("count"))
    .withColumn("rank", F.row_number().over(Window.orderBy(F.desc("count"))))
)
lang_ordered.show(5, False)

# take the language for each manga that has the lowest rank
manga_name = (
    manga_name_lang.join(lang_ordered, "lang")
    .withColumn(
        "manga_lang_rank",
        F.row_number().over(Window.partitionBy("manga_id").orderBy("rank")),
    )
    .filter(F.col("manga_lang_rank") == 1)
    .select("manga_id", "name", "lang")
)
manga_name.where("lang <> 'en'").show()

+------------------------------------+-------------------------------------------------------+
|id                                  |relationship                                           |
+------------------------------------+-------------------------------------------------------+
|6b64bfb7-8fff-4633-82e2-340cbb8bc92e|{060503b8-a561-4dd8-8607-6524eebb90bf, null, author}   |
|6b64bfb7-8fff-4633-82e2-340cbb8bc92e|{060503b8-a561-4dd8-8607-6524eebb90bf, null, artist}   |
|6b64bfb7-8fff-4633-82e2-340cbb8bc92e|{ebcc4898-8a52-4fb5-b3f9-53983c699fe6, null, cover_art}|
|4660003f-15c9-4b52-84c1-ba46c6943edf|{a49063fb-6a3e-4a4b-a304-dc1a1198afba, null, author}   |
|4660003f-15c9-4b52-84c1-ba46c6943edf|{2a1c83c6-4bfd-41bf-8e76-2e0917ea3a64, null, artist}   |
+------------------------------------+-------------------------------------------------------+
only showing top 5 rows

+---------+-----+
|     type|count|
+---------+-----+
|   artist|80287|
|    manga|25689|
|cover_art|65690|
|   author|8

In [12]:
chapter = spark.read.parquet("../data/processed/2022-12-16-mangadex-chapter.parquet")

In [44]:
chapter_groups = chapter.select(
    F.col("id").alias("chapter_id"),
    F.col("relationships.scanlation_group").alias("group_id"),
    F.col("relationships.manga").alias("manga_id"),
)
chapter_groups.show(5, False)

+------------------------------------+------------------------------------+------------------------------------+
|chapter_id                          |group_id                            |manga_id                            |
+------------------------------------+------------------------------------+------------------------------------+
|0002870f-2597-4b04-84d6-a1f4266f2b9d|c6931ee7-b4cd-44da-a52b-c8d1a90db4d2|44e60bff-ca42-4f5d-9730-b556854a0077|
|0002c873-7652-461e-944e-e544e18424bb|3eef1981-4ab5-434c-a13a-8128351447b7|371a7405-bee1-402c-b0d7-74ea3fb4d587|
|0005d1a2-2492-4752-bd84-fdec525988c8|145f9110-0a6c-4b71-8737-6acb1a3c5da4|350960aa-06ad-428d-ab00-1091a230a70f|
|00075450-2506-436d-b39c-829483d9c536|7f4ea5d0-6af4-48a4-b56c-c7240668096b|c12bded1-f4d1-43e9-8d02-83a62ce78db9|
|0007f58f-40c8-4637-97c6-ada252ac62e5|c1a3aadb-8b80-4456-93b7-68ba90f819ce|b9f8de37-18bc-4932-8793-7313cc3061c1|
+------------------------------------+------------------------------------+---------------------

In [48]:
group_manga = group_names.join(chapter_groups, "group_id").join(manga_name, "manga_id")
group_manga = group_manga.groupBy(
    *[c for c in group_manga.columns if c != "chapter_id"]
).agg(F.countDistinct("chapter_id").alias("chapter_count"))
group_manga.show(3, False, True)

-RECORD 0---------------------------------------------
 manga_id      | af54d2db-ee64-4f77-afd2-90ed23334dc2 
 group_id      | 1cc47109-9777-420c-994c-a497d3c4fbac 
 group_name    | Henka no Kaze                        
 name          | Kare wa, Ano Ko no Mono              
 lang          | en                                   
 chapter_count | 4                                    
-RECORD 1---------------------------------------------
 manga_id      | c914a502-c30a-4324-9003-4207ec32b07c 
 group_id      | 07ec9f2d-7961-4b38-92be-1eb2fca5a461 
 group_name    | Lion's Ridge                         
 name          | Reform with no Wasted Draws          
 lang          | en                                   
 chapter_count | 68                                   
-RECORD 2---------------------------------------------
 manga_id      | 19daf6ef-6d95-46e5-9e1a-f4e5b655902f 
 group_id      | ddd2776a-c49e-41ec-8f01-7fc5a98d21cf 
 group_name    | Eleven Scanlator                     
 name     

In [51]:
group_summary = group_manga.groupBy("group_id", "group_name").agg(
    F.countDistinct("manga_id").alias("manga_count"),
    F.sum("chapter_count").alias("total_chapters"),
)
group_summary.show(5, False)

+------------------------------------+----------------+--------------+-----------+
|group_id                            |group_name      |total_chapters|manga_count|
+------------------------------------+----------------+--------------+-----------+
|aed39467-d5f8-4212-83ca-7e82586324c7|Outerworld Scans|565           |30         |
|a07f801d-599f-47c4-bb7a-df943863b86b|Psylocke Scans  |812           |107        |
|c236e525-e38f-4fa5-86a8-dd23a6771d63|Giant Ethicist  |531           |16         |
|3cba2871-3446-484b-980e-e8e645b55695|Blissful Sin    |470           |95         |
|a72210da-c862-4587-a10b-e302d6a4463a|Kusoshop        |23            |2          |
+------------------------------------+----------------+--------------+-----------+
only showing top 5 rows

