# My submission to the ArangoDb Hackathon
## By Patrick Wendo

The dataset used in this notebook is available on Kaggle [here](https://www.kaggle.com/datasets/andreuvallhernndez/myanimelist)

In [1]:
!pip3 install kagglehub pandas numpy networkx matplotlib nx_arangodb scipy
!pip3 install --upgrade langchain langchain-community langchain-openai langgraph

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


In [2]:
from arango import ArangoClient

import networkx as nx
import nx_arangodb as nxadb
import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
from random import randint
import ast
import re

from langgraph.prebuilt import create_react_agent
from langgraph.checkpoint.memory import MemorySaver
from langchain_openai import ChatOpenAI
from langchain_community.graphs import ArangoGraph
from langchain_community.chains.graph_qa.arangodb import ArangoGraphQAChain
from langchain_core.tools import tool


db = ArangoClient(hosts="http://localhost:8529").db(username="root", password="Ptsd314159", verify=True)

print(db)

anime = pd.read_csv(
    "./datasets/myanimelist/anime.csv"
)

manga = pd.read_csv(
    "./datasets/myanimelist/manga.csv"
)

anime.info()


[22:53:08 +0300] [INFO]: NetworkX-cuGraph is unavailable: No module named 'cupy'.


<StandardDatabase _system>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24985 entries, 0 to 24984
Data columns (total 39 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   anime_id          24985 non-null  int64  
 1   title             24985 non-null  object 
 2   type              24927 non-null  object 
 3   score             16050 non-null  float64
 4   scored_by         24985 non-null  int64  
 5   status            24985 non-null  object 
 6   episodes          24438 non-null  float64
 7   start_date        24110 non-null  object 
 8   end_date          22215 non-null  object 
 9   source            21424 non-null  object 
 10  members           24985 non-null  int64  
 11  favorites         24985 non-null  int64  
 12  episode_duration  24387 non-null  object 
 13  total_duration    24162 non-null  object 
 14  rating            24405 non-null  object 
 15  sfw               24985 non-null  bool   
 16  approved     

In [3]:
manga.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64833 entries, 0 to 64832
Data columns (total 30 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   manga_id           64833 non-null  int64  
 1   title              64833 non-null  object 
 2   type               64833 non-null  object 
 3   score              24636 non-null  float64
 4   scored_by          64833 non-null  int64  
 5   status             64833 non-null  object 
 6   volumes            48211 non-null  float64
 7   chapters           46042 non-null  float64
 8   start_date         62950 non-null  object 
 9   end_date           51932 non-null  object 
 10  members            64833 non-null  int64  
 11  favorites          64833 non-null  int64  
 12  sfw                64833 non-null  bool   
 13  approved           64833 non-null  bool   
 14  created_at_before  64833 non-null  object 
 15  updated_at         62678 non-null  object 
 16  real_start_date    629

### Data Exploration

- We have 2 datasets, anime and manga. 
- These are both relational datasets with a single entry having multiple values. We need to figure out how to represent this data as a graph. 

#### Our Strategy.
- We could represent this as an attributed graph with node attributes and edge attributes. An **attributed graph** is one where aside from node labels, or edge labels, a node or an edge will also have additional metadata. For example, in a graph about people, we could have a node with the label "Anna" with attributes `{position: "CEO", start_date: "2019-08-19"}`. 

- The node label would be the name of the anime/manga. We could also have some columns be extracted to nodes of their own. For instance, an anime could fall into multiple genres. We could have each of those as a separate node. Similarly for columns like studios, themes, producers, licensors and demographics.

In [4]:
# Exploration of the anime dataset
pd.set_option("display.max.columns", None)
[anime.shape, manga.shape]

[(24985, 39), (64833, 30)]

- The anime dataset has about **24,985 rows with 39 columns**, while the manga dataset has **64,833 rows and 30 columns**

In [5]:
anime.columns

Index(['anime_id', 'title', 'type', 'score', 'scored_by', 'status', 'episodes',
       'start_date', 'end_date', 'source', 'members', 'favorites',
       'episode_duration', 'total_duration', 'rating', 'sfw', 'approved',
       'created_at', 'updated_at', 'start_year', 'start_season',
       'real_start_date', 'real_end_date', 'broadcast_day', 'broadcast_time',
       'genres', 'themes', 'demographics', 'studios', 'producers', 'licensors',
       'synopsis', 'background', 'main_picture', 'url', 'trailer_url',
       'title_english', 'title_japanese', 'title_synonyms'],
      dtype='object')

In [6]:
manga.columns

Index(['manga_id', 'title', 'type', 'score', 'scored_by', 'status', 'volumes',
       'chapters', 'start_date', 'end_date', 'members', 'favorites', 'sfw',
       'approved', 'created_at_before', 'updated_at', 'real_start_date',
       'real_end_date', 'genres', 'themes', 'demographics', 'authors',
       'serializations', 'synopsis', 'background', 'main_picture', 'url',
       'title_english', 'title_japanese', 'title_synonyms'],
      dtype='object')

##### Data Fix #1
- Fix string that look like lists, `"[list, item]" -> ["list", "item"]`. This will allow us to use the #explode() function later when creating data in NetworkX

In [7]:
anime_columns_to_fix = ["genres", "themes", "demographics", "title_synonyms", "studios"]
manga_columns_to_fix = ["genres", "themes", "demographics", "authors"]

def rewrite_anime(col):
    anime[col] = anime[col].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)

def rewrite_manga(col):
    manga[col] = manga[col].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)

for col in anime_columns_to_fix:
    rewrite_anime(col)

for col in manga_columns_to_fix:
    rewrite_manga(col)

##### Data Fix #2
- because the anime, manga, genres, themes, etc etc will use an ID of some kind, we update the anime ID to be prefixed with `a_<id>`

In [8]:
def update_anime_id(anime_df):
    anime_df["anime_id"] = anime_df["anime_id"].astype(str).apply(lambda x: f"{x}") + "_" + anime_df["title"]
    return anime_df  # Ensure the function returns the updated DataFrame


anime = update_anime_id(anime)
anime.head(1)["anime_id"]

0    5114_Fullmetal Alchemist: Brotherhood
Name: anime_id, dtype: object

In [9]:
## Identify Key groups in the following columns: "genres", "themes", "demographics", "studios". These could be shared between anime entries.

genre_explode = anime.explode('genres')
genre_grouped_data = genre_explode.groupby("genres")
anime_genres = genre_grouped_data.groups.keys()

theme_explode = anime.explode('themes')
theme_grouped_data = theme_explode.groupby("themes")
anime_themes = theme_grouped_data.groups.keys()

demographics_explode = anime.explode('demographics')
demographic_grouped_data = demographics_explode.groupby("demographics")
anime_demographics = demographic_grouped_data.groups.keys()

studios_explode = anime.explode('studios')
studio_grouped_data = studios_explode.groupby("studios")
anime_studios = studio_grouped_data.groups.keys()


##### Dealing with the authors column in the manga dataset.
- The author field in Manga datasets should be extracted into it's own dataset. Each author will be a node on it's own. They may be shared across datasets. The node label will be the id with attributes being first name, last name and role. 

In [10]:
manga_authors = manga["authors"]
manga_authors.explode().head()[0]

0    {'id': 1868, 'first_name': 'Kentarou', 'last_n...
0    {'id': 49592, 'first_name': '', 'last_name': '...
Name: authors, dtype: object

##### Creating new DataFrames for NetworkX compatibility

- The node data needs to be in a form like 
```
    {
        node: <node-name>,
        attribute_1: <attribute_1>,
        attribute_2: <attribute_2>,
        attribute_3: <attribute_3>,

    }
```

- Further, we need to define edge data. This would be an edge list of the form:
```
    {
        source: <source>,
        target: <target>,
        attribute_1: <attribute_1>,
        attribute_2: <attribute_2>,
        attribute_3: <attribute_3>,
    }
```

In [35]:
# Building dataframe for "genres", "themes", "demographics", "studios" to be used in building nodes. 

demographics_df = pd.DataFrame(anime_demographics, columns=['demographics']).reset_index()
genres_df = pd.DataFrame(anime_genres, columns=['genres']).reset_index()
themes_df = pd.DataFrame(anime_themes, columns=["themes"]).reset_index()
studios_df = pd.DataFrame(anime_studios, columns=['studios']).reset_index()

list_of_dfs = [demographics_df, genres_df, themes_df, studios_df]

def update_indices(list_dfs):
    current_idx = 0
    for i in range(len(list_dfs)):  
        list_dfs[i] = list_dfs[i].reset_index(drop=True) 
        list_dfs[i]['_key'] = range(current_idx, current_idx + len(list_dfs[i])) 
        list_dfs[i].index = range(current_idx, current_idx + len(list_dfs[i]))
        current_idx += len(list_dfs[i])

    return list_dfs  # Optional: return updated list


[demographics_df_idx, genres_df_idx, themes_df_idx, studios_df_idx] = update_indices(list_of_dfs)
print(genres_df_idx)

    index         genres  _key
5       0         Action     5
6       1      Adventure     6
7       2    Avant Garde     7
8       3  Award Winning     8
9       4      Boys Love     9
10      5         Comedy    10
11      6          Drama    11
12      7          Ecchi    12
13      8        Erotica    13
14      9        Fantasy    14
15     10     Girls Love    15
16     11        Gourmet    16
17     12         Hentai    17
18     13         Horror    18
19     14        Mystery    19
20     15        Romance    20
21     16         Sci-Fi    21
22     17  Slice of Life    22
23     18         Sports    23
24     19   Supernatural    24
25     20       Suspense    25


##### Making the dataframe that can be loaded into networkX

The dataframes `demographics_df_idx, genres_df_idx, themes_df_idx, studios_df_idx` all have unique indices. These will be used in defining an edge list for the anime dataset. For now we work on creating the node list for the anime. 



In [12]:
# anime node attributes
node_label = "anime_id"
node_attributes = {
    "id": "anime_id",
    "name": "title",
    "type": "type", 
    "score": "score", 
    "status": "status",
    "start_date": "real_start_date", 
    "end_date": "real_end_date", 
    "source": "source", 
    "episode_duration": "episode_duration", 
    "total_duration": "total_duration",
    "sfw": "sfw", 
    "start_year": "start_year", 
    "start_season": "start_season", 
    "broadcast_day": "broadcast_day",
    "main_picture": "main_picture",
    "url": "url",
    "trailer_url": "trailer_url",
    "title_english": "title_english",
    "title_japanese": "title_japanese",
    "title_synonyms": "title_synonyms" 
}

(node_attributes.values())
attributed_anime = anime[anime.columns.intersection(list(node_attributes.values()))]


In [13]:
## Building the edge list. 

def build_edge_list(dataframe1, dataframe2, merge_field):
    exploded_df = dataframe1.explode(merge_field)
    edge_list_df = exploded_df.merge(dataframe2, left_on=merge_field, right_on=merge_field)
    edge_list_df = edge_list_df[['anime_id', 'index']]
    return edge_list_df

demographics_edge_list = build_edge_list(anime, demographics_df_idx, "demographics")
genres_edge_list = build_edge_list(anime, genres_df_idx, "genres")
themes_edge_list = build_edge_list(anime, themes_df_idx, "themes")
studios_edge_list = build_edge_list(anime, studios_df_idx, "studios")


In [14]:
anime.loc[anime['title'] == 'Cowboy Bebop']

Unnamed: 0,anime_id,title,type,score,scored_by,status,episodes,start_date,end_date,source,members,favorites,episode_duration,total_duration,rating,sfw,approved,created_at,updated_at,start_year,start_season,real_start_date,real_end_date,broadcast_day,broadcast_time,genres,themes,demographics,studios,producers,licensors,synopsis,background,main_picture,url,trailer_url,title_english,title_japanese,title_synonyms
17,1_Cowboy Bebop,Cowboy Bebop,tv,8.75,923377,finished_airing,26.0,1998-04-03,1999-04-24,original,1788584,79192,0 days 00:24:00,0 days 10:24:00,r,True,True,2005-06-30 05:01:56+00:00,2023-04-02 18:20:37+00:00,1998.0,spring,1998-04-03,1999-04-24,saturday,01:00:00,"[Action, Award Winning, Sci-Fi]","[Adult Cast, Space]",[],[Sunrise],['Bandai Visual'],"['Funimation', 'Bandai Entertainment']","Crime is timeless. By the year 2071, humanity ...",When Cowboy Bebop first aired in spring of 199...,https://cdn.myanimelist.net/images/anime/4/196...,https://myanimelist.net/anime/1/Cowboy_Bebop,https://www.youtube.com/watch?v=qig4KOK2R2g,Cowboy Bebop,カウボーイビバップ,[]


#### NetworkX

- Playing around with NetworkX

In [36]:

# Adding nodes for "genres", "themes", "demographics", "studios"
G = nx.Graph()

G.add_nodes_from(demographics_df_idx.set_index('_key').to_dict(orient='index').items())
G.add_nodes_from(genres_df_idx.set_index('_key').to_dict(orient='index').items())
G.add_nodes_from(themes_df_idx.set_index('_key').to_dict(orient='index').items())
G.add_nodes_from(studios_df_idx.set_index('_key').to_dict(orient='index').items())
G.add_nodes_from(attributed_anime.set_index('anime_id').to_dict(orient='index').items())
genres_edges = list(genres_edge_list.itertuples(index=False, name=None))
demographics_edges = list(demographics_edge_list.itertuples(index=False, name=None))
studios_edges = list(studios_edge_list.itertuples(index=False, name=None))
theme_edges = list(themes_edge_list.itertuples(index=False, name=None))
G.add_edges_from(theme_edges)
G.add_edges_from(genres_edges)
G.add_edges_from(demographics_edges)
G.add_edges_from(studios_edges)
G.nodes
# print(studios_attributes.items())


NodeView((0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 

In [16]:
# plot_options = {"node_size": 7, "with_labels": False, "width": 0.15}
# pos = nx.spring_layout(G, iterations=5, seed=1019)
# fig, ax = plt.subplots(figsize=(25, 15))
# nx.draw_networkx(G, pos=pos, ax=ax, **plot_options)

### Persisting The Data in ArangoDB



In [37]:
G_adb = nxadb.Graph(
    name="Anime2",
    db=db,
    incoming_graph_data=G,
    write_batch_size=50000,
    overwrite_graph=True
)


[23:04:28 +0300] [INFO]: Overwriting graph 'Anime2'
[23:04:28 +0300] [INFO]: Graph 'Anime2' exists.
[23:04:28 +0300] [INFO]: Default node type set to 'Anime2_node'
[2025/02/13 23:04:28 +0300] [281347] [INFO] - adbnx_adapter: Instantiated ADBNX_Adapter with database '_system'


[2025/02/13 23:04:29 +0300] [281347] [INFO] - adbnx_adapter: Created ArangoDB 'Anime2' Graph


In [38]:
# nodes_collection = db.collection('Anime2_node')
# # print(nodes_collection)

# for node, data in G.nodes(data=True):
#     print(data)

# for node, data in G.nodes(data=True):
#     key = node if isinstance(node, str) else str(node)  # Only convert if not already a string

#     nodes_collection.insert({
#         "_key": key,  # Use the checked key
#         **data
#     }, overwrite=True)

print(G_adb)

Graph named 'Anime2' with 0 nodes and 83360 edges
